Quick Links

Re: Parallel grouping sets

From:	Richard Guo <riguo(at)pivotal(dot)io>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Parallel grouping sets
Date:	2019-07-31 08:06:30
Message-ID:	CAN_9JTyPnkHuZ8ruWCwW-QiCefntONaE+E0w7H63hw2ywyh-4w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo(at)pivotal(dot)io> wrote:
> >
> >> Hi all,
> >>
> >> Paul and I have been hacking recently to implement parallel grouping
> >> sets, and here we have two implementations.
> >>
> >> Implementation 1
> >> ================
> >>
> >> Attached is the patch and also there is a github branch [1] for this
> >> work.
> >>
> >
> >Rebased with the latest master.
> >
>
> Hi Richard,
>
> thanks for the rebased patch. I think the patch is mostly fine (at least I
> don't see any serious issues). A couple minor comments:
>

Hi Tomas,

Thank you for reviewing this patch.

>
> 1) I think get_number_of_groups() would deserve a short explanation why
> it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
> same branch. Before these cases were clearly separated, now it seems a bit
> mixed up and it may not be immediately obvious why it's OK.
>

Added a short comment in get_number_of_groups() explaining the behavior
when doing partial aggregation for grouping sets.

>
> 2) There are new regression tests, but they are not added to any schedule
> (parallel or serial), and so are not executed as part of "make check". I
> suppose this is a mistake.
>

Yes, thanks. Added the new regression test in parallel_schedule and
serial_schedule.

>
> 3) The regression tests do check plan and results like this:
>
> EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
> SELECT ... ORDER BY 1, 2, 3;
>
> which however means that the query might easily use a different plan than
> what's verified in the eplain (thanks to the additional ORDER BY clause).
> So I think this should explain and execute the same query.
>
> (In this case the plans seems to be the same, but that may easily change
> in the future, and we could miss it here, failing to verify the results.)
>

Thank you for pointing this out. Fixed it in V4 patch.

>
> 4) It might be a good idea to check the negative case too, i.e. a query on
> data set that we should not parallelize (because the number of partial
> groups would be too high).
>

Yes, agree. Added a negative case.

>
>
> Do you have any plans to hack on the second approach too? AFAICS those two
> approaches are complementary (address different data sets / queries), and
> it would be nice to have both. One of the things I've been wondering is if
> we need to invent gset_id as a new concept, or if we could simply use the
> existing GROUPING() function - that uniquely identifies the grouping set.
>
>
Yes, I'm planning to hack on the second approach in short future. I'm
also reconsidering the gset_id stuff since it brings a lot of complexity
for the second approach. I agree with you that we can try GROUPING()
function to see if it can replace gset_id.

Thanks
Richard

Attachment	Content-Type	Size
v4-0001-Implementing-parallel-grouping-sets.patch	application/octet-stream	24.6 KB

In response to

Re: Parallel grouping sets at 2019-07-30 15:05:30 from Tomas Vondra

Responses

Re: Parallel grouping sets at 2019-09-30 09:41:23 from Pengzhou Tang

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Langote	2019-07-31 08:17:09	Re: Problem with default partition pruning
Previous Message	Amit Langote	2019-07-31 08:04:38	Re: partition routing layering in nodeModifyTable.c