Advice requested on structuring aggregation queries

From: Dave Crooke <dcrooke(at)gmail(dot)com>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Advice requested on structuring aggregation queries
Date: 2010-02-23 03:01:33
Message-ID: ca24673e1002221901s12c44a38t9a64fdde6b604cbb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi folks

I have an application which collects performance stats at time intervals, to
which I am retro-fitting a table partitioning scheme in order to improve
scalability.

The original data is keyed by a 3-ary tuple of strings .... to keep the row
size down, in the new data model I'm actually storing 32-bit int's in
Postgres. The new schema for each table looks like this:

(a integer,
b integer,
c integer,
ts timestamp without timezone,
value double precision)

with two indexes: (a, b, ts) and (b, ts)

If the table had a primary key, it would be (a, b, c, ts) but I don't need
one, and I don't want to carry column "c" in the index as it isn't necessary
... the number of distinct values of c for any (a,b) pair is small (less
than 10). Queries that read the data for graphing are of the form:

..... where a=<constant> and b=<constant> and ts between <x> and <y>

One of the things I need to do periodically is roll up this data, e.g.
calculate min / max / average values of a given datum over a period for use
in graphing at larger time scales. The rollup tables use the same schema as
the raw data ones.

There are about 60 different values of b, and for each such value there is a
exactly one type of rollup. The old code is doing the rollups in Postgres
with 60 bulk "insert into .... select" statements, hence the need for the
second index.

For rollups which are instrinsic SQL functions, and only depend on the value
column, this is straightforward, e.g. to take the averages:

insert into rollup_table
select a, 17, c, '2009-02-22 19:00', avg(value)
from data_table
where b=17
and ts between '2009-02-22 18:00' and '2009-02-22 19:00'
group by a, c;

So far so good.

This gets slightly more tricky if the rollup you want to do isn't a simple
SQL function .... I have two such cases, (a) the most recent value and (b)
the modal (most common) value. In both cases, I've been doing this using a
dependent subquery in the select clause, e.g. for the most recent value:

insert into rollup_table
select a, 23, c, '2009-02-23 00:00',
(select y.value from
(select z.value, z.ts from data_table z where z.a=.a and z.b=x.b and
z.c=x.c) y
order by y.ts desc limit 1)
from data_table x
where b=23
group by a, c;

Due to the second index, this actually performs quite well, but it of course
depends on doing a small index scan on the same data_table for each row of
the outer select. I have a few questions:

*Q1.* Is this the right place for the subquery, or would it be better to
move it to the where clause, e.g.

insert into rollup_table
select a, 23, c, '2009-02-23 00:00', value
from data_table x
where b=23 and
ts=(select max(y.ts) from data_table y where y.a=x.a and y.b=23 and
y.c=x.c)
group by a, c;

or does the answer depend on row stats? Is there a rule of thumb on whether
to prefer "limit 1" vs. "max"? There will be between 20 and 300 rows with
different timestamps for each (a,b,c) tuple.

For better scalability, I am partitioning these tables by time .... I am not
using PG's table inheritance and triggers to do the partitioning, but
instead dynamically generating the SQL and table names in the application
code (Java). In most cases, the rollups will still happen from a single
source "data_table" and I plan to continue using the existing SQL, but I
have a few cases where the source "data_table" rows may actually come from
two adjacent tables.

For intrinsic SQL rollup functions like avg / min / max this is pretty easy:

insert into rollup_table
select a, 17, c, '... date ...', max(value) from (
select ........ from data_table_1 where b=17 and .......
union all
select ........ from data_table_2 where b=17 and .......
)
group by a, c;

*Q2.* Is there any benefit to doing rollups (max() and group by) in the
inner queries inside the UNION ALL, or is it a wash? For now I'm expecting
each inner select to produce 20-200 rows per unqiue (a,b,c) combination, and
to be unioning no more than 2 tables at a time. (I'm aware that max() and
min() are composable, but that doing this with avg() and getting correct
results involves using sum's and count's' in the subqueries and doing a
division in the outer select, but AFAICT that's orthogonal to the
performance considerations of whether to do the inner rollups or not).

*Q3.* (You saw it coming) What really has me wrapped around the axle is how
to do the "most recent" and "mode" rollups efficiently in this scenario ....
in the examples in Q1 above, in both cases the SQL references the
data_tabletwice, which is fine if it's a real table, but in the
partiioned scenario
the data has to be pulled together at run time by the UNION ALL subquery,
and I'd prefer not to execute it multiple times - I'm not much of a SQL
wordsmith but I can't see way to write this in plain SQL without repeating
the subquery, e.g.

insert into rollup_table
select a, 23, c, '2009-02-23 00:00',
(select y.value from
(select z.value, z.ts from *(... union all ...)* z where z.a=x.a and
z.b=23 and z.c=x.c) y
order by y.ts desc limit 1)
from *(... union all ...)* x
group by a, c;

*a. *Is there a way to write this in plain SQL without syntactically
repeating the "union all" subquery, or in a way that will ensure PG only
runs that subquery once?

*b.* Is the PG query planner smart enough to figure out that the union all
subquery will return the same result every time (assuming
ISOLATION_SERIALIZABLE) and so only run it once?

*c.* Other ideas I had were
(i) temporarily create a VIEW which bridges the two tables, and use one
of the query structures from Q1 above; or
(ii) write the results of some kind of subquery to a temporary table, run
a query against that, then drop the temporary table; or
(iii) pull the data back to the application layer and deal with it there;
or
(iv) stored procedure maybe????
Is there something better?
**

*Q4.* When partitioning tables for speed / performance reasons, what are
good rules of thumb to use to determine how big to make the partitions?

P.S. I realized while drafting this email that instead of running a
bulk insert
... select statement for each of the 60 values of b, I could group them
according to the rollup algorithm being used, and switch the second index to
being just (ts). Since the data is being laid down in time series order, the
I/O pattern this creates on the table pages should look like a sequential
scan even though it's index driven.

All advice would be most gratefully appreciated!

Cheers
Dave

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joe Conway 2010-02-23 05:23:25 Re: Advice requested on structuring aggregation queries
Previous Message Bruce Momjian 2010-02-23 02:47:29 Re: Planner question - "bit" data types