Re: [HACKERS] Partition-wise aggregation/grouping

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Partition-wise aggregation/grouping
Date: 2018-03-19 17:45:49
Message-ID: CA+TgmoaR+AGqs=RLzp92_qWaRpHdJRbyqTvQN9SAL2qRrc0Mrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 16, 2018 at 1:50 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> This patch also renames can_parallel_agg to
>> can_partial_agg and removes the parallelism-specific bits from it.
>
> I think we need to update the comments in this function to use phrase
> "partial aggregation" instead of "parallel aggregation". And I think
> we need to change the conditions as well. For example if
> parse->groupClause == NIL, why can't we do partial aggregation? This
> is the classical case when we will need patial aggregation. Probably
> we should test this with Jeevan's patches for partition-wise aggregate
> to see if it considers partition-wise aggregate or not.

I think the case where we have neither any aggregates nor a grouping
clause is where we are doing SELECT DISTINCT. Something like SELECT
COUNT(*) FROM ... is not this case because that has an aggregate.

I'm sort of on the fence as to whether and how to update the comments.
I agree that it's a little strange to leave the comments here
referencing parallel aggregation when the function has been renamed to
is_partial_agg(), but a simple search-and-replace doesn't necessarily
improve the situation very much. Most notably, hasNonSerial is
irrelevant for partial but non-parallel aggregation, but we still have
the test because we haven't done the work to really do the right thing
here, which is to separately track whether we can do parallel partial
aggregation (either hasNonPartial or hasNonSerial is a problem) and
non-parallel partial aggregation (only hasNonPartial is a problem).
This needs a deeper reassessment, but I don't think that can or should
be something we try to do right now.

> OR When parse->groupingSets is true, I can see why we can't use
> parallel query, but we can still compute partial aggregates. This
> condition doesn't hurt since partition-wise aggregation bails out when
> there are grouping sets, so it's not that harmful here.

I haven't thought deeply about what will break when GROUPING SETS are
in use, but it's not the purpose of this patch set to make them work
where they didn't before. The point of hoisting the first two tests
out of this function was just to avoid doing repeated work when
partition-wise aggregate is in use. Those two tests could conceivably
produce different results for different children, whereas the
remaining tests won't give different answers. Let's not get
distracted by the prospect of improving the tests. I suspect that's
not anyway so simple to achieve as what you seem to be speculating
here...

>> I am sort of unclear whether we need/want GroupPathExtraData at all.
>> What determines whether something gets passed via GroupPathExtraData
>> or just as a separate argument? If we have a rule that stuff that is
>> common to all child grouped rels goes in there and other stuff
>> doesn't, or stuff postgres_fdw needs goes in there and other stuff
>> doesn't, then that might be OK. But I'm not sure that there is such a
>> rule in the v20 patches.
>
> We have a single FDW hook for all the upper relations and that hook
> can not accept grouping specific arguments. Either we need a separate
> FDW hook for grouping OR we need some way of passing upper relation
> specific information down to an FDW. I think some FDWs and extensions
> will be happy if we provide them readymade decisions for can_sort,
> can_hash, can_partial_agg etc. It will be good if they don't have to
> translate the grouping target and havingQual for every child twice,
> once for core and second time in the FDW. In all it looks like we need
> some structure to hold that information so that we can pass it down
> the hook. I am fine with two structures one variable and other
> invariable. An upper operation can have one of them or both.

I'm fine with using a structure to bundle details that need to be sent
to the FDW, but why should the FDW need to know about
can_sort/can_hash? I suppose if it needs to know about
can_partial_agg then it doesn't really cost anything to pass through
all the flags, but I doubt whether the FDW has any use for the others.
Anyway, if that's the goal, let's just make sure that the set of
things we're passing that way are exactly the set of things that we
think the FDW might need.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-03-19 17:46:16 Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
Previous Message Robert Haas 2018-03-19 17:26:18 Re: [HACKERS] Partition-wise aggregation/grouping