From: | Richard Huxton <dev(at)archonet(dot)com> |
---|---|
To: | Dave Crooke <dcrooke(at)gmail(dot)com> |
Cc: | Garrett Murphy <gmurphy(at)lawlogix(dot)com>, pgsql-performance <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Extracting superlatives - SQL design philosophy |
Date: | 2010-02-25 08:10:37 |
Message-ID: | 4B86307D.3000901@archonet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On 24/02/10 23:37, Dave Crooke wrote:
> 1. The city temps table is a toy example, not meant to be realistic :-)
You knew that and I guessed it, but it's worth stating these things for
people who read the archives a year from now.
> 2. Yes, my (Java) algorithm is deterministic ... it will return
> exactly one row per city, and that will be the row (or strictly, *a*
> row) containing the highest temp. Temp value ties will break in favour
> of earlier rows in Guinness Book of Records tradition :-) It's
> equivalent to a HashAggregate implementation.
But not when you add in other columns (which is what you're trying to do).
> The following two query plans (from my real schema) illustrate the
> itch I am trying to scratch .... I want the functionality of the 2nd
> one, but with the execution plan structure of the first:
>
> # explain analyse select a, max(b) from perf_raw_2010_02_23 group by a;
> QUERY
> PLAN
> --------------------------------------------------------------------------------------------------------------------------------------
> HashAggregate (cost=117953.09..117961.07 rows=639 width=8) (actual
> time=10861.845..10863.008 rows=1023 loops=1)
> -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39
> rows=5276139 width=8) (actual time=0.038..4459.222 rows=5276139
> loops=1)
> Total runtime: 10863.856 ms
> (3 rows)
>
> Time: 10864.817 ms
> # explain analyse select distinct on (a) * from perf_raw_2010_02_23
> order by a, b desc ;
One big bit of the cost difference is going to be the ordering you need
to get a repeatable result.
> QUERY
> PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------
> Unique (cost=1059395.04..1085775.73 rows=639 width=28) (actual
> time=46011.204..58428.210 rows=1023 loops=1)
> -> Sort (cost=1059395.04..1072585.39 rows=5276139 width=28)
> (actual time=46011.200..53561.112 rows=5276139 loops=1)
> Sort Key: a, b
> Sort Method: external merge Disk: 247584kB
> -- actually OS RAM buffers
Even if the sort never actually reaches a physical disk, you should
still see an increase by increasing sort_mem for the duration of the one
query. It's not going to be the magnitude you want, but probably worth
doing.
> -> Seq Scan on perf_raw_2010_02_23 (cost=0.00..91572.39
> rows=5276139 width=28) (actual time=0.047..6491.036 rows=5276139
> loops=1)
> Total runtime: 58516.185 ms
> (6 rows)
>
> Time: 58517.233 ms
>
> The only difference between these two is that the second query returns
> the whole row. The *ratio* in cost between these two plans increases
> in proportion to log(n) of the table size ... at 5.5m rows its
> livable, at 500m it's probably not :-!
If performance on this query is vital to you, and the table doesn't
change after initial population (which I'm guessing is true) then try an
index on (a asc, b desc) and CLUSTER that index. Depending on the ratio
of distinct a:b values that could be what you're after.
--
Richard Huxton
Archonet Ltd
From | Date | Subject | |
---|---|---|---|
Next Message | Pierre C | 2010-02-25 08:35:27 | Re: Extracting superlatives - SQL design philosophy |
Previous Message | Dave Crooke | 2010-02-24 23:37:11 | Re: Extracting superlatives - SQL design philosophy |