Re: using custom scan nodes to prototype parallel sequential scan

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: using custom scan nodes to prototype parallel sequential scan
Date: 2014-11-14 00:55:48
Message-ID: CAApHDvp1uS6sTFOWysM48Z1paziJeTaXUZsQeJ4dPMLFuv76sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 14, 2014 at 1:19 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 12 November 2014 07:54, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> > On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
> >>
> >>
> >> This plan type is widely used in reporting queries, so will hit the
> >> mainline of BI applications and many Mat View creations.
> >> This will allow SELECT count(*) FROM foo to go faster also.
> >>
> >
> > We'd also need to add some infrastructure to merge aggregate states
> together
> > for this to work properly. This means that could also work for avg() and
> > stddev etc. For max() and min() the merge functions would likely just be
> the
> > same as the transition functions.
>
> Do you mean something like a "subtotal" or "intermediate combination"
> functon?
>
>
If you had 4 parallel workers performing a seqscan, say the relation was
4000 pages in total, you could say that worker 1 would scan blocks 0-999,
worker 2, 1000-1999 etc. After each worker had finished, there would then
be 4 sets of records then needed to be merged into 1 set.

Take int4_avg_accum() for example it does:

transdata->count++;
transdata->sum += newval;

The merge function would need to perform something like:

transdata->count += transdata2merge.count;
transdata->sum += transdata2merge.sum;

Then the final function could be called on the merged aggregate state.

The same can be applied when the query contains a GROUP BY clause, just
we'd need pay attention to which groups we merge together for that to work
Any HAVING clause would have to be applied after the groups have been
merged.

This whole topic is pretty exciting for data warehouse type workloads.

Regards

David Rowley

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2014-11-14 00:57:16 Re: Using 128-bit integers for sum, avg and statistics aggregates
Previous Message Tom Lane 2014-11-14 00:53:56 Re: EXPLAIN ANALYZE output weird for Top-N Sort