Re: Avoiding cartesian product

From: Szűcs Gábor <surrano(at)gmail(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Avoiding cartesian product
Date: 2006-01-09 15:59:16
Message-ID: 43C28854.9000605@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Dear Virag,

AFAIK aggregates aren't indexed in postgres (at least not before 8.1, which
indexes min and max, iirc).

Also, I don't think you need to exactly determine the trace_id. Try this one
(OTOH; might be wrong):

select DISTINCT ON (a.trace_id, a.seq_no) -- See below
b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
from jam_trace_sys a, jam_trace_sys b
where a.trace_id = 22
and b.trace_id = a.trace_id
and b.seq_no > a.seq_no -- Simply ">" is enough
order by a.trace_id, a.seq_no, b.seq_no; -- DISTINCT, see below

The trick is that DISTINCT takes the first one in each group (IIRC it is
granted, at least someone told me on one of these lists :) ) so if you order
by the DISTINCT attributes and then by b.seq_no, you'll get the smallest of
appropriate b.seq_no values for each DISTINCT values.

The idea of DISTINCTing by both columns is to make sure the planner finds
the index. (lately I had a similar problem: WHERE a=1 ORDER BY b LIMIT 1
used an index on b, instead of an (a,b) index. Using ORDER BY a,b solved it)

HTH,

--
G.

On 2006.01.04. 5:12, Virag Saksena wrote:
>
> I have a table which stores cumulative values
> I would like to display/chart the deltas between successive data
> collections
>
> If my primary key only increments by 1, I could write a simple query
>
> select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, jam_trace_sys b
> where a.trace_id = 22
> and b.trace_id = a.trace_id
> and b.seq_no = a.seq_no + 1
> order by a.seq_no;
>
> However the difference in sequence number is variable.
> So (in Oracle) I used to extract the next seq_no using a correlated
> sub-query
>
> select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, jam_trace_sys b
> where a.trace_id = 22
> and (b.trace_id, b.seq_no) =
> (select a.trace_id, min(c.seq_no) from jam_trace_sys c
> where c.trace_id = a.trace_id and c.seq_no > a.seq_no)
> order by a.seq_no;
>
> For every row in A, The correlated sub-query from C will execute
> With an appropriate index, it will just descend the index Btree
> go one row to the right and return that row (min > :value)
> and join to table B
>
> SELECT STATEMENT
> SORT ORDER BY
> TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
> NESTED LOOPS
> TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS A
> INDEX RANGE SCAN JAM_TRACE_SYS_N1 A
> INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
> SORT AGGREGATE
> INDEX RANGE SCAN JAM_TRACE_SYS_N1 C
>
> In postgreSQL A and B are doing a cartesian product
> then C gets executed for every row in this cartesian product
> and most of the extra rows get thrown out.
> Is there any way to force an execution plan like above where the
> correlated subquery runs before going to B.
> The table is small right now, but it will grow to have millions of rows
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------
> Sort (cost=124911.81..124944.84 rows=13213 width=20) (actual
> time=13096.754..13097.053 rows=149 loops=1)
> Sort Key: a.seq_no
> -> Nested Loop (cost=4.34..124007.40 rows=13213 width=20) (actual
> time=1948.300..13096.329 rows=149 loops=1)
> Join Filter: (subplan)
> -> Seq Scan on jam_trace_sys b (cost=0.00..3.75 rows=175
> width=16) (actual time=0.005..0.534 rows=175 loops=1)
> -> Materialize (cost=4.34..5.85 rows=151 width=16) (actual
> time=0.002..0.324 rows=150 loops=175)
> -> Seq Scan on jam_trace_sys a (cost=0.00..4.19
> rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
> Filter: (trace_id = 22)
> SubPlan
> -> Aggregate (cost=4.67..4.67 rows=1 width=4) (actual
> time=0.486..0.488 rows=1 loops=26250)
> -> Seq Scan on jam_trace_sys c (cost=0.00..4.62
> rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
> Filter: ((trace_id = $0) AND (seq_no > $1))
> Total runtime: 13097.557 ms
> (13 rows)
>
> pglnx01=> \d jam_trace_sys
> Table "public.jam_trace_sys"
> Column | Type | Modifiers
> -----------------+---------+-----------
> trace_id | integer |
> seq_no | integer |
> cpu_utilization | integer |
> gc_minor | integer |
> gc_major | integer |
> heap_used | integer |
> Indexes:
> "jam_trace_sys_n1" btree (trace_id, seq_no)
>
> pglnx01=> select count(*) from jam_trace_Sys ;
> count
> -------
> 175
> (1 row)
>
> pglnx01=> select trace_id, count(*) from jam_trace_sys group by trace_id ;
> trace_id | count
> ----------+-------
> 15 | 2
> 18 | 21
> 22 | 150
> 16 | 2
> (4 rows)

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alessandro Baretta 2006-01-09 16:23:06 Re: 500x speed-down: Wrong query plan?
Previous Message Hannu Krosing 2006-01-09 15:48:21 Re: Stats collector performance improvement