Re: WIP: Aggregation push-down

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Aggregation push-down
Date: 2017-09-11 08:30:40
Message-ID: CAFjFpRfUKq3Asgtki1XctPkCN6YC4oA2vNWh66OgBo-H26ePWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 8, 2017 at 7:04 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
> Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>
>> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
>> > Antonin Houska <ah(at)cybertec(dot)at> wrote:
>> >
>> >> Antonin Houska <ah(at)cybertec(dot)at> wrote:
>> >>
>> >> > This is a new version of the patch I presented in [1].
>> >>
>> >> Rebased.
>> >>
>> >> cat .git/refs/heads/master
>> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
>> >
>> > This is another version of the patch.
>> >
>> > Besides other changes, it enables the aggregation push-down for postgres_fdw,
>> > although only for aggregates whose transient aggregation state is equal to the
>> > output type. For other aggregates (like avg()) the remote nodes will have to
>> > return the transient state value in an appropriate form (maybe bytea type),
>> > which does not depend on PG version.
>>
>> Having transient aggregation state type same as output type doesn't
>> mean that we can feed the output of remote aggregation as is to the
>> finalization stage locally or finalization is not needed at all. I am
>> not sure why is that test being used to decide whether or not to push
>> an aggregate (I assume they are partial aggregates) to the foreign
>> server.
>
> I agree. This seems to be the same problem as reported in [2].
>
>> postgres_fdw doesn't have a feature to fetch partial aggregate
>> results from the foreign server. Till we do that, I don't think this
>> patch can push any partial aggregates down to the foreign server.
>
> Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn
> does not exist and the transient type can be transfered from the remote server
> in textual form? (Of course there's a question is if such behaviour is
> consistent from user's perspective.)

Yes, those functions will work.

>
>> >
>> > One thing I'm not sure about is whether the patch should remove
>> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes
>> > obsolete. Or should it only be deprecated so far? I understand that
>> > deprecation is important for "ordinary SQL users", but FdwRoutine is an
>> > interface for extension developers, so the policy might be different.
>>
>> I doubt if that's correct. We can do that only when we know that all
>> kinds aggregates/grouping can be pushed down the join tree, which
>> looks impossible to me. Apart from that that FdwRoutine handles all
>> kinds of upper relations like windowing, distinct, ordered which are
>> not pushed down by this set of patches.
>
> Good point.
>

I think this is where Jeevan Chalke's partition-wise aggregation
helps. It deals with the cases where your patch can not push the
aggregates down to children of join.

>> The patch maintains an extra rel target, two pathlists, partial pathlist and
>> non-partial pathlist for grouping in RelOptInfo. These two extra
>> pathlists require "grouped" argument to be passed to a number of functions.
>> Instead probably it makes sense to create a RelOptInfo for aggregation/grouping
>> and set pathlist and partial_pathlist in that RelOptInfo. That will also make a
>> place for keeping grouped estimates.
>
> If grouped paths require a separate RelOptInfo, why the existing partial paths
> do not?

partial paths produce the same targetlist and the same relation that
non-partial paths do. A RelOptInfo in planner represents a set of rows
and all paths added to that RelOptInfo produce same whole result
(parameterized paths produces the same result collectively). Grouping
paths however are producing a different result and different
targetlist, so may be it's better to have a separate RelOptInfo.

>
>> For placeholders we have two function add_placeholders_to_base_rels() and
>> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but do
>> not have corresponding function for joinrels. How do we push aggregates
>> involving two or more relations to the corresponding joinrels?
>
> add_grouping_info_to_base_rels() calls prepare_rel_for_grouping() which
> actually adds the "grouping info". For join, prepare_rel_for_grouping() is
> called from build_join_rel().

Ok.

>
> While PlaceHolderVars need to be finalized before planner starts to create
> joins, the GroupedPathInfo has to fit particular pair of joined relations.
>
> Perhaps create_grouping_info_... would be better, to indicate that the thing
> we're adding does not exist yet. I'll think about it.
>
>> Some minor assorted comments ...
>
> o.k., will fix.
>
>> Some quick observations using two tables t1(a int, b int) and t2(a int, b int),
>> populated as
>> insert into t1 select i, i % 5 from generate_series(1, 100) i where i % 2 = 0;
>> insert into t2 select i, i % 5 from generate_series(1, 100) i where i % 3 = 0;
>>
>> 1. The patch pushes aggregates down join in the following query
>> explain verbose select avg(t2.a) from t1 inner join t2 on t1.b = t2.b group by
>> t2.b;
>> but does not push the aggregates down join in the following query
>> explain verbose select avg(t2.a) from t1 left join t2 on t1.b = t2.b group by
>> t2.b;
>
>> In fact, it doesn't use the optimization for any OUTER join. I think the reason
>> for this behaviour is that the patch uses equivalence classes to distribute the
>> aggregates and grouping to base relations and OUTER equi-joins do not form
>> equivalence classes. But I think it should be possible to push the aggregates
>> down the OUTER join by adding one row for NULL values if the grouping is pushed
>> to the inner side. I don't see much change for outer side. This also means that
>> we have to choose some means other than equivalence class for propagating the
>> aggregates.
>
> The problem is that aggregate pushed down to the nullable side of an outer
> join receives different input values than the original aggregate at the top
> level of the query. NULL values generated by the OJ make the difference. I
> have no idea how to handle this problem. If the aggregation is performed on
> the nullable side of the OJ, it can't predict which of the input values don't
> match any value of the other side. Suggestions are appreciated.

I haven't thought through this fully as well, but I think this can be
fixed by using some kind of all NULL row on the nullable side. If we
are going to use equivalence classes, we won't be able to extend that
mechanism to OUTER joins, so may be you want to rethink about using
equivalence classes as a method of propagating grouping information.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2017-09-11 08:31:36 Re: mysql_fdw + PG10: unrecognized node type: 217
Previous Message Ashutosh Bapat 2017-09-11 08:17:22 Re: [POC] hash partitioning