Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com>
To: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2020-04-06 11:43:15
Message-ID: CAG-ACPVfhe9TRjPVx95BQrY5svbw07ZCbgy7u7uy_=skHiKTOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 3 Apr 2020 at 20:45, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:

> Hi,
>
> On Thu, Apr 2, 2020 at 2:12 AM Ashutosh Bapat
> <ashutosh(dot)bapat(at)2ndquadrant(dot)com> wrote:
> > On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
> >> I've started reviewing the patch a couple days ago. I haven't done any
> >> extensive testing, but I do have a bunch of initial comments that I can
> >> share now.
> >>
> >> 1) I wonder if this needs to update src/backend/optimizer/README, which
> >> does have a section about partitionwise joins. It seems formulated in a
> >> way that that probably covers even this more advanced algorithm, but
> >> maybe it should mention handling of default partitions etc.?
>
> > Done. Please check the wording. It might need some word smithy.
>
> You heavily changed the existing documentation about PWJ, but I don't
> think we really need to do so. Also, IMO I think the description
> about default partitions you added is needed in README. I think it
> would be better to put such a description in source files. How about
> something like the attached, instead? I wrote part of this based on
> the commit message in the original versions of the patch you posted.
>

I corrected some grammar, typos. Broke longer sentences into smaller ones
so that its easy to read and understand. As is the concept is hard to
understand with all its limitations. Thanks for the example. Retained it.

You seem to have removed few comments that explained the algorithm in
detail from build_joinrel_partition_info(). It would have been good to have
those there. But I am ok not having them either.

But it will be good to have following addition I suggested in my patches to
make sure nparts is set to 0 for an unpartitioned relation as per the
comment on nparts in RelOptInfo.
@@ -1653,6 +1663,8 @@ build_joinrel_partition_info(RelOptInfo *joinrel,
RelOptInfo *outer_rel,
jointype, restrictlist))
{
Assert(!IS_PARTITIONED_REL(joinrel));
+ /* Join is not partitioned. */
+ joinrel->nparts = 0;
return;
}

> >> There certainly needs to be some description of the algorithm somewhere,
> >> either in a README or before a suitable function. It doesn't have to be
> >> particularly detailed, a rough outline of the matching would be enough,
> >> so that readers don't have to rebuild the knowledge from pieces
> >> scattered around various comments.
>
> > The algorithm for list and range partitioned tables is slightly
> different. So, I have added separate prologue to each list_merge_bounds()
> and range_merge_bounds(). Please check if that serves the purpose.
>
> Too detailed to me. In this:
>
> + * If there are multiple partitions from one side matching a given
> partition on
> + * the other side, the algorithm bails out since we do not have machinary
> for
> + * joining one partition with mulitple partitions. It might happen that
> any of
> + * the list items of a partition from the outer relation do not appear in
> the
> + * inner relation and there is no default partition in the inner
> relation. Such
> + * a partition from the outer side will have no matching partition on the
> inner
> + * side. The algorithm will bail out in such a case since we do not have a
> + * mechanism to perform a join with a non-existing relation.
>
> I don't think the last comment is correct; that would apply to the old
> versions of this function IIRC, but not to the latest version. How
> about something much simpler like the attached, instead?
>

I know that algorithm pretty well by now, so it suffices for me to say we
use something similar to merge join, but may be for someone without that
background a detailed explanation is useful. But this looks fine at the
moment.

> > Done. Actually this wasn't updated when partition pruning was
> introduced, which could cause a partitionwise join to be not used even when
> those conditions were met. Similarly when a query involved whole row
> reference. It's hard to explain all the conditions under which
> partitionwise join technique will be used. But I have tried to keep it easy
> to understand.
>
> IMO I think your words "there is exactly one pair of matching
> partitions." is a bit misleading, because that sounds like that PWJ
> doesn't allow multiply-segmented join. How about s/exact
> matching/one-to-one matching/ in the existing documentation, instead?

Good catch. That was really misleading. Looks good to me.

> >> 3) I think the for nparts comment is somewhat misleading:
> >>
> >> int nparts; /* number of partitions; 0 = not partitioned;
> >> * -1 = not yet set */
> >>
> >> which says that nparts=0 means not partitioned. But then there are
> >> conditions like this:
> >>
> >> /* Nothing to do, if the join relation is not partitioned. */
> >> if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
> >> return;
> >>
> >> which (ignoring the obsolete comment) seems to say we can have nparts==0
> >> even for partitioned tables, no?
>
> Yeah, I think I was a bit hasty. I fixed this.
>

For a non-join relation, nparts = 0 and nparts = -1 both have the same
meaning. Although we never set nparts = 0 for a non-join relation?
Otherwise, the comment looks good now.

--
Best Wishes,
Ashutosh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2020-04-06 11:45:34 Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Previous Message David Rowley 2020-04-06 11:39:03 Re: Make MemoryContextMemAllocated() more precise