Re: Combining Aggregates

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Combining Aggregates
Date: 2015-03-04 16:11:28
Message-ID: CA+TgmoZoeSFynwm-UsU+tHExVmyZNRJS_Zs3+gGQsnfdAkLvug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 4, 2015 at 4:41 AM, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>> This thread mentions "parallel queries" as a use case, but that means
>> passing data between processes, and that requires being able to
>> serialize and deserialize the aggregate state somehow. For actual data
>> types that's not overly difficult I guess (we can use in/out functions),
>> but what about aggretates using 'internal' state? I.e. aggregates
>> passing pointers that we can't serialize?
>
> This is a good question. I really don't know the answer to it as I've not
> looked at the Robert's API for passing data between backends yet.
>
> Maybe Robert or Amit can answer this?

I think parallel aggregation will probably require both the
infrastructure discussed here, namely the ability to combine two
transition states into a single transition state, and also the ability
to serialize and de-serialize transition states, which has previously
been discussed in the context of letting hash aggregates spill to
disk. My current thinking is that the parallel plan will look
something like this:

HashAggregateFinish
-> Funnel
-> HashAggregatePartial
-> PartialHeapScan

So the workers will all pull from a partial heap scan and each worker
will aggregate its own portion of the data. Then it'll need to pass
the results of that step back to the master, which will aggregate the
partial results and produce the final results. I'm guessing that if
we're grouping on, say, column a, the HashAggregatePartial nodes will
kick out 2-column tuples of the form (<value for a>, <serialized
transition state for group with that value for a>).

Of course this is all pie in the sky right now, but I think it's a
pretty good bet that partial aggregation is a useful thing to be able
to do. Postgres-XC put a bunch of work into this, too, so there's
lots of people saying "hey, we want that".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-03-04 16:18:53 Re: xpath changes in the recent back branches
Previous Message Stephen Frost 2015-03-04 16:11:16 Re: MD5 authentication needs help