Re: Extracting superlatives - SQL design philosophy

From: Dave Crooke <dcrooke(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Extracting superlatives - SQL design philosophy
Date: 2010-03-09 16:13:30
Message-ID: ca24673e1003090813tc0126e8v5be0866c67827199@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Cool trick .... I didn't realise you could do this at the SQL level without
a custom max() written in C.

What I ended up doing for my app is just going with straight SQL that
generates the "key" tuples with a SELECT DISTINCT, and then has a dependent
subquery that does a very small index scan to pull the data for each row (I
care somewhat about portability). In order to make this perform, I created a
second index on the raw data table that has the columns tupled in the order
I need for this rollup, which allows PG to do a fairly efficient index range
scan.

I had been trying to avoid using the disk space to carry this 2nd index,
since it is only needed for the bulk rollup, and I then reliased I only have
to keep it on the current day's partition, and I can drop it once that
partition's data has been aggregated (the insert overhead of the index isn't
as much of a concern).

Alternatively, I could have lived without the index by sharding the raw data
right down to the rollup intervals, which would mean that rollups are
effective as a full table scan anyway, but I didn't want to do that as it
would make real-time data extration queries slower if they had to go across
10-20 tables.

Thanks everyone for the insights

Cheers
Dave

On Tue, Mar 9, 2010 at 6:46 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> On Wed, Feb 24, 2010 at 4:31 PM, Dave Crooke <dcrooke(at)gmail(dot)com> wrote:
> > This is a generic SQL issue and not PG specific, but I'd like to get
> > an opinion from this list.
> >
> > Consider the following data:
> >
> > # \d bar
> > Table "public.bar"
> > Column | Type | Modifiers
> > --------+-----------------------------+-----------
> > city | character varying(255) |
> > temp | integer |
> > date | timestamp without time zone |
> >
> > # select * from bar order by city, date;
> > city | temp | date
> > -----------+------+---------------------
> > Austin | 75 | 2010-02-21 15:00:00
> > Austin | 35 | 2010-02-23 15:00:00
> > Edinburgh | 42 | 2010-02-23 15:00:00
> > New York | 56 | 2010-02-23 15:00:00
> > New York | 78 | 2010-06-23 15:00:00
> > (5 rows)
> >
> > If you want the highest recorded temperature for a city, that's easy
> > to do, since the selection criteria works on the same column that we
> > are extracing:
> >
> > # select city, max(temp) from bar group by city order by 1;
> > city | max
> > -----------+-----
> > Austin | 75
> > Edinburgh | 42
> > New York | 78
> > (3 rows)
> >
> >
> > However there is (AFAIK) no simple way in plain SQL to write a query
> > that performs such an aggregation where the aggregation criteria is on
> > one column and you want to return another, e.g. adding the the *date
> > of* that highest temperature to the output above, or doing a query to
> > get the most recent temperature reading for each city.
> >
> > What I'd like to do is something like the below (and I'm inventing
> > mock syntax here, the following is not valid SQL):
> >
> > -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> > select city, temp, date from bar where date=max(date) group by city,
> > temp order by city;
> >
> > or perhaps
> >
> > -- More explicit
> > select aggregate_using(max(date), city, temp, date) from bar group by
> > city, temp order by city;
> >
> > Both of the above, if they existed, would be a single data access
> > followed by and sort-merge.
> >
> > The only way I know how to do it involves doing two accesses to the data,
> e.g.
> >
> > # select city, temp, date from bar a where date=(select max(b.date)
> > from bar b where a.city=b.city) order by 1;
> > city | temp | date
> > -----------+------+---------------------
> > Austin | 35 | 2010-02-23 15:00:00
> > Edinburgh | 42 | 2010-02-23 15:00:00
> > New York | 78 | 2010-06-23 15:00:00
> > (3 rows)
> >
> >
> > # explain select * from bar a where date=(select max(b.date) from bar
> > b where a.city=b.city) order by 1;
> > QUERY PLAN
> >
> --------------------------------------------------------------------------
> > Sort (cost=1658.86..1658.87 rows=1 width=528)
> > Sort Key: a.city
> > -> Seq Scan on bar a (cost=0.00..1658.85 rows=1 width=528)
> > Filter: (date = (subplan))
> > SubPlan
> > -> Aggregate (cost=11.76..11.77 rows=1 width=8)
> > -> Seq Scan on bar b (cost=0.00..11.75 rows=1
> > width=8) -- would be an index lookup in a real scenario
> > Filter: (($0)::text = (city)::text)
>
> Another cool way to do this is via a custom aggregate:
> create function maxdata(data, data) returns data as
> $$
> select case when ($1).date > ($2).date then $1 else $2 end;
> $$ language sql;
>
> create aggregate maxdata(data)
> (
> sfunc=maxdata,
> stype=data
> );
>
> select (d).* from
> (
> select maxdata(data) as d from data group by city
> );
>
> It does it in a single pass. Where this approach can pay dividends is
> when you have a very complicated 'max'-ing criteria to justify the
> verbosity of creating the aggregate. If you are not doing the whole
> table, the self join is often faster. I'm surprised custom aggregates
> aren't used more...they seem very clean and neat to me.
>
> merlin
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message John KEA 2010-03-09 18:44:50 Deleting Large Objects
Previous Message Merlin Moncure 2010-03-09 15:03:16 Re: Out of shared memory in postgres 8.4.2 and locks