Re: PoC: Grouped base relation

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC: Grouped base relation
Date: 2017-01-18 04:33:35
Message-ID: CAFjFpRfqp4hM3Dnpo=C-+SvT9Zogc-qFzmDR=VqmCEW2Q1sGPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 17, 2017 at 10:07 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
> Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> [... snip ]]
>>
>> This all works well, as long as the aggregate is "summing" something
>> across rows. The method doesn't work when aggregation is say
>> "multiplying" across the rows or "concatenating" across the rows like
>> array_agg() or string_agg(). They need a different strategy to combine
>> aggregates across relations.
>
> Good point. The common characteristic of these seems to be that thay don't
> have aggcombinefn defined.

I don't think aggcombinefn isn't there because we couldn't write it
for array_agg() or string_agg(). I guess, it won't be efficient to
have those aggregates combined across parallel workers.

Also, the point is naming that kind of function as aggtransmultifn
would mean that it's always supposed to multiply, which isn't true for
all aggregates.

>
>> IIUC, we are trying to solve multiple problems here:
>
>> 1. Pushing down aggregates/groups down join tree, so that the number of rows
>> to be joined decreases. This might be a good optimization to have. However
>> there are problems in the current patch. Every path built for a relation
>> (join or base) returns the same result expressed by the relation or its
>> subset restricted by parameterization or unification. But this patch changes
>> that. It creates paths which represent grouping in the base relation. I
>> think, we need a separate relation to represent that result and hold paths
>> which produce that result. That itself would be a sizable patch.
>
> Whether a separate relation (RelOptInfo) should be created for grouped
> relation is an important design decision indeed. More important than your
> argument about the same result ("partial path", used to implement parallel
> nodes actually does not fit this criterion perfectly - it only returns part of
> the set) is the fact that the data type (target) differs.

Right!

>
> I even spent some time coding a prototype where separate RelOptInfo is created
> for the grouped relation but it was much more invasive. In particular, if only
> some relations are grouped, it's hard to join them with non-grouped ones w/o
> changing make_rel_from_joinlist and subroutines substantially. (Decision
> whether the plain or the grouped relation should be involved in joining makes
> little sense at the leaf level of the join tree.)
>
> So I took the approach that resembles the partial paths - separate pathlists
> within the same RelOptInfo.

Yes, it's hard, but I think without having a separate RelOptInfo the
design won't be complete. Is there a subset of problem that can be
solved by using a separate RelOptInfo e.g. pushing aggregates down
child relations or anything else.

>
>> 2. Try to push down aggregates based on the equivalence classes, where
>> grouping properties can be transferred from one relation to the other using
>> EC mechanism.
>
> I don't think the EC part should increase the patch complexity a lot. Unless I
> missed something, it's rather isolated to the part where target of the grouped
> paths is assembled. And I think it's important even for initial version of the
> patch.
>
>> This seems to require solving the problem of combining aggregates across the
>> relations. But there might be some usecases which could benefit without
>> solving this problem.
>
> If "combining aggregates ..." refers to joining grouped relations, then I
> insist on doing this in the initial version of the new feature too. Otherwise
> it'd only work if exactly one base relation of the query is grouped.

No. "combining aggregates" refers to what aggtransmultifn does. But,
possibly that problem needs to be solved in the first step itself.

>
>> 3. If the relation to which we push the aggregate is an append relation,
>> push (partial) aggregation/grouping down into the child relations. - We
>> don't do that right now even for grouping aggregation on a single append
>> table. Parallel partial aggregation does that, but not exactly per
>> relation. That may be a sizable project in itself. Even without this piece
>> the rest of the optimizations proposed by this patch are important.
>
> Yes, this can be done in a separate patch. I'll consider it.
>
>> 4. Additional goal: push down the aggregation to any relation (join/base)
>> where it can be computed.
>
> I think this can be achieved by adding extra aggregation nodes to the join
> tree. As I still anticipate more important design changes, this part is not at
> the top of my TODO list.

I guess, attempting this will reveal some more design changes
required; it may actually simplify the design.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-18 04:49:45 Re: jsonb_delete with arrays
Previous Message Michael Paquier 2017-01-18 04:26:46 Re: Patch to implement pg_current_logfile() function