Re: Combining Aggregates

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>
Subject: Re: Combining Aggregates
Date: 2016-03-20 21:55:19
Message-ID: 4c881a76-9682-6871-ddaa-4f58e0fd25b5@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 03/20/2016 04:48 AM, David Rowley wrote:
> On 17 March 2016 at 14:25, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 03/16/2016 12:03 PM, David Rowley wrote:
>>> Patch 2:
>>> This adds the serial/deserial aggregate infrastructure, pg_dump
>>> support, CREATE AGGREGATE changes, and nodeAgg.c changes to have it
>>> serialise and deserialise aggregate states when instructed to do so.
>>>
>>> Patch 3:
>>> This adds a boat load of serial/deserial functions, and combine
>>> functions for most of the built-in numerical aggregate functions. It
>>> also contains some regression tests which should really be in patch 2,
>>> but I with patch 2 there's no suitable serialisation or
>>> de-serialisation functions to test CREATE AGGREGATE with. I think
>>> having them here is ok, as patch 2 is quite useless without patch 3
>>> anyway.
>>
>> I don't see how you could move the tests into #2 when the functions are
>> defined in #3? IMHO this is the right place for the regression tests.
>
> Yeah, but the CREATE AGGREGATE changes which are being tested in 0003
> were actually added in 0002. I think 0002 is the right place to test
> the changes to CREATE AGGREGATE, but since there's a complete lack of
> functions to use, then I've just delayed until 0003.
>
>>> Another thing to note about this patch is that I've gone and created
>>> serial/de-serial functions for when PolyNumAggState both require
>>> sumX2, and don't require sumX2. I had thought about perhaps putting an
>>> extra byte in the serial format to indicate if a sumX2 is included,
>>> but I ended up not doing it this way. I don't really want these serial
>>> formats getting too complex as we might like to do fun things like
>>> pass them along to sharded servers one day, so it might be nice to
>>> keep them simple.
>>
>>
>> Hmmm, I've noticed that while eyeballing the diffs, and I'm not
>> sure if it's worth the additional complexity at this point. I mean,
>> one byte is hardly going to make a measurable difference - we're
>> probably wasting more than that due to alignment, for example.
>
> I don't think any alignment gets done here. Look at pq_sendint(). Did
> you mean the complexity of having extra functions, or having the
> extra byte to check in the deserial func?

I haven't realized alignment does not apply here, but I still think the
extra byte would be negligible in the bigger scheme of things. For
example we're serializing the state into a bytea, which adds up to 4B
header. So I don't think adding 1B would make any measurable difference.

But that assumes adding the 1B header would actually make the code
simpler. Now that I think about it, that may not the case - in the end
you'd probably get a function with two branches, with about the same
size as the two functions combined. So not really an improvement.

>> As you've mentioned sharded servers - how stable should the
>> serialized format be? I've assumed it to be transient, i.e. in the
>> extreme case it might change after restarting a database - for the
>> parallel aggregate that's clearly sufficient.
>>
>> But if we expect this to eventually work in a sharded environment,
>> that's going to be way more complicated. I guess in these cases we
>> could rely on implicit format versioning, via the major the
>> version (doubt we'd tweak the format in a minor version anyway).
>>
>> I'm not sure the sharding is particularly useful argument at this
>> point, considering we don't really know if the current format is
>> actually a good match for that.
>
> Perhaps you're right. At this stage I've no idea if we'd want to
> support shards on varying major versions. I think probably not,
> since the node write functions might not be compatible with the node
> read functions on the other server.

OK, so let's ignore the sharded setup for now.

> I've attached another series of patches:
>
> 0001: This is the latest Parallel Aggregate Patch, not intended for
> review here, but is required for the remaining patches. This patch has
> changed quite a bit from the previous one that I posted here, and the
> remaining patches needed re-based due to those changes.
>
> 0002: Adds serial/de-serial function support to CREATE AGGREGATE,
> contains minor fix-ups from last version.
>
> 0003: Adds various combine/serial/de-serial functions for the standard
> set of aggregates, apart from most float8 aggregates.
>
> 0004: Adds regression tests for 0003 pg_aggregate.h changes.
>
> 0005: Documents to mention which aggregate functions support partial mode.
>
> 0006: Re-based version of Haribabu's floating point aggregate support,
> containing some fixes by me.

I went through the changes and found nothing suspicious, except maybe
for the wording in one of the doc patches:

All types apart from floating-point types

It may not be entirely clear for the readers, but this does not include
"numeric" data type, only float4 and float8. But I don't think this
really matters unless we fail to commit the last patch adding functions
even for those data types.

Also, I think it's probably the right time to fix the failures in
opr_sanity regression tests. After all, we'll only whitelist a bunch of
serialize functions, so the tests will still detect any unexpected
functions dealing with 'internal' data type.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2016-03-20 21:56:30 Re: extend pgbench expressions with functions
Previous Message Thomas Munro 2016-03-20 21:46:12 Re: Performance degradation in commit ac1d794