Re: Combining Aggregates

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>
Subject: Re: Combining Aggregates
Date: 2015-12-21 09:02:29
Message-ID: CAKJS1f9hGtS0c1Zc_uGDvNARDXbC4Siie873BJ74Bhu1-YxEtQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18 December 2015 at 01:28, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
wrote:

> # select sum(x::numeric) from generate_series(1,3) x(x);
> ERROR: invalid memory alloc request size 18446744072025250716
>
> The reason that this happens is down to the executor always thinking that
> Aggref returns the aggtype, but in cases where we're not finalizing the
> aggregate the executor needs to know that we're actually returning
> aggtranstype instead. Hash aggregates appear to work as the trans value is
> just stuffed into a hash table, but with plain and sorted aggregates this
> gets formed into a Tuple again, and forming a tuple naturally requires the
> types to be set correctly! I'm not sure exactly what the best way to fix
> this will be. I've hacked something up in the attached test patch which
> gets around the problem by adding a new aggtranstype to Aggref and also an
> 'aggskipfinalize' field which I manually set to true in a bit of a hacky
> way inside the grouping planner. Then in exprType() for Aggref I
> conditionally return the aggtype or aggtranstype based on the
> aggskipfinalize setting. This is most likely not the way to properly fix
> this, but I'm running out of steam today to think of how it should be done,
> so I'm currently very open to ideas on this.
>

Ok, so it seems that my mindset was not in parallel process space when I
was thinking about this. I think having the pointer in the Tuple is
probably going to be fine for this multiple stage aggregation when that's
occurring in a single backend process, but obviously if the memory that the
pointer points to belongs to a worker process in a parallel aggregate
situation, then bad things will happen.

Now, there has been talk of this previously, on various threads, but I
don't believe any final decisions were made on how exactly it should be
done. At the moment I plan to make changes as follows:

1. Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and
aggserialtype These will only be required when aggtranstype is INTERNAL.
Perhaps we should disallow CREATE AGGREAGET from accepting them for any
other type... The return type of aggserialfn should be aggserialtype, and
it should take a single parameter of aggtranstype. aggdeserialfn will be
the reverse of that.
2. Add a new bool field to nodeAgg's state named serialStates. If this
is field is set to true then when we're in finalizeAgg = false mode, we'll
call the serialfn on the agg state instead of the finalfn. This will allow
the serialized state to be stored in the tuple and sent off to the main
backend. The combine agg node should also be set to serialStates = true,
so that it knows to deserialize instead of just assuming that the agg state
is of type aggtranstype.

I believe this should allow us to not cause any performance regressions by
moving away from INTERNAL agg states. It should also be very efficient for
internal states such as Int8TransTypeData, as this struct merely has 2
int64 fields which should be very simple to stuff into a bytea varlena
type. We don't need to mess around converting the ->count and ->sum into a
strings or anything.

Then in order for the planner to allow parallel aggregation all aggregates
must:

1. Not have a DISTINCT or ORDER BY clause
2. Have a combinefn
3. If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn.

We can relax the requirement on 3 if we're using 2-stage aggregation, but
not parallel aggregation.

Any objections, or better ideas?

That just leaves me to figure out how to set the correct Aggref->aggtype
during planning, as now there's 3 possible types:

if (finalizeAggs == false)
{
if (serialStates == true)
type = aggserialtype;
else
type = aggtranstype;
}
else
type = prorettype; /* normal case */

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2015-12-21 09:24:40 Re: Optimizing away second VACUUM heap scan when only BRIN indexes on table
Previous Message Kyotaro HORIGUCHI 2015-12-21 08:27:31 Re: Freeze avoidance of very large table.