The Future of Aggregation

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, kevin(dot)grittner(at)enterprisedb(dot)com, amit(dot)kapila(at)enterprisedb(dot)com, Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
Subject: The Future of Aggregation
Date: 2015-06-09 10:58:59
Message-ID: CAKJS1f-TmWi-4c5K6CBLRdTfGsVxOJhadefzjE7SWuVBgMSkXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It appears to me that there's quite a few new features and optimisations on
the not too distant horizon which will require adding yet more fields into
pg_aggregate.

These are things along the lines of:

1. Parallel Aggregation (computes each aggregate state in parallel worker
processes and then merges these states in serial mode)
2. Aggregate push-down / Aggregate before join (requires passing partially
computed aggregate states between executor nodes)
3. Auto-updating Materialized views (ones which contain aggregate functions)
4. Foreign table aggregation
5. Dependant Aggregates (I talked about earlier here ->
http://www.postgresql.org/message-id/CAKJS1f8ebkc=EhEq+ArM8vwYZ5vSapJ1Seub5=FvRRuDCtFfsQ@mail.gmail.com
)

Items 1-4 above I believe require support of "Aggregate State Combine
Support" -> https://commitfest.postgresql.org/5/131/ which I believe will
need to be modified to implement complex database types to backup our
internal aggregate state types so that these types be properly passed
between executor nodes, between worker processes and perhaps foreign data
wrappers (maybe just postgres_fdw I've not looked into this yet)

Item 5 makes items 1-4 a bit more complex as with this item there's
opportunity for very good performance improvements by allowing aggregates
like AVG(x) also perform all the required work to allow SUM(x) and COUNT(x)
to be calculated for "free" in a query containing all 3 aggregates.

I've discussed item 5 off-list with Simon and he mentioned that we might
invent a transition state and transition functions which can have parts
switched on and off much like how calcSumX2 controls if do_numeric_accum()
should calculate sum(x*x) or not. The problem with this is that if we ever
want to store aggregate states in an auto-updating materialized view, then
this generic aggregate state will have to contain at least 3 fields (to
store count(x), sum(x) and sum(x*x)), and those 3 fields would have to be
stored even if the aggregate was just a simple count(*).

The idea I discussed in the link in item 5 above gets around this problem,
but it's a perhaps more surprise filled implementation as it will mean
"select avg(x),sum(x),count(x) from t" is actually faster than "select
sum(x),count(x) from t" as the agg state for avg() will satisfy sum and
count too.

The purpose of this email is to open the doors for discussion about this so
that nobody goes off and develops feature X into a corner and disallows
feature Y and so that we end up with the most optimal solution that does
not trip us up in the future.

I'm interested to hear if Kevin or Amit have had any time to give this any
thought before. It would be good to ensure we all have the same vision here.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Chalke 2015-06-09 11:54:28 Re: bugfix: incomplete implementation of errhidecontext
Previous Message Jeevan Chalke 2015-06-09 09:59:38 Dead code in Create/RenameRole() after RoleSpec changes related to CURRENT/SESSION_USER