Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2018-08-23 10:31:47
Message-ID: CA+q6zcWRaHhW7ZDrdJ0UHKJ_xpSyMEjE9f0F_t=vOq-AzcvvvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Fri, 27 Jul 2018 at 20:13, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Jul 27, 2018 at 3:17 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> > Apart from the complexity there's also a possibility that this
> > skipping will reduce the efficiency actually in normal cases. Consider
> > a case where A and B have exactly matching partitions. Current
> > partition matching algorithm compare any given range/bound only once
> > and we will have O(n) algorithm. If we use a binary search, however,
> > for every range comparison, it will be O(n log n) algorithm. There
> > will be unnecessary comparisons during binary search. The current
> > algorithm is always O(n), whereas binary search would be O(n log(n))
> > with a possibility of having sub-O(n) complexity in some cases. I
> > would go for an algorithm which has a consistent time complexity
> > always and which is efficient in normal cases, rather than worrying
> > about some cases which are not practical.
>
> Yeah, I think that's a good point. The normal case here will be that
> the partition bounds are equal, or that there are a few extra
> partitions on one side that don't exist on the other. We don't want
> other cases to be crazily inefficient, but I suspect in practice that
> if the partitioning bounds aren't pretty close to matching up exactly,
> we're going to end up failing to be able to do a partition-wise join
> at all. It's not very likely that somebody happens to have a case
> where they've partitioned two tables in randomly different ways, but
> then they decide to join them anyway, but then it turns out that the
> partition bounds happen to be compatible enough that this algorithm
> works.

> On Mon, 23 Jul 2018 at 10:38, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> 1. those cases are rare enough that we can ignore those right now. How
> many times we would encounter the case you have quoted, for example?
> Usually the ranges will be continuous only differing in the first or
> last partition e.g time-series data.

Ok, but what about list partitioning? I believe the area of application for it
can be more diverse than mostly just for time-series, and in the patch almost
the same approach is used to merge list partitions.

Other than that everything seems fine from functionality point of view, and so
far I couldn't find any situation that produces a wrong plan. Some of the joins
were not that intuitive from the first glance, but at the end everything was
according to the documentation.
Taking this into account I wonder if it's possible somehow to give any hints in
an explain result about why it wasn't possible to apply partition wise join? If
something wasn't clear for me I ended up looking at the value of "merged" flag,
and to figure out actual reason one needs to trace the entire algorithm.

Besides that I checked the performance in some simple cases, no problems on
this side (but I'll also do some checks for more complicated joins).

And I still have some complaints about readability, although I can imagine that
it's just me:

* Many functions carry some unrelated arguments just to pass them through,
which obscures the purpose of a function.

* I want to suggest to introduce a new data structure for partitions mapping
instead of just keeping them in arrays (was it discussed already before?).

* What is the reason that almost everywhere in the patch there is a naming
convention "outer/inner" partition maps, and sometimes (e.g. in
generate_matching_part_pairs) it's "map1/map2"?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-08-23 10:58:26 Make executor's Range Table an array instead of a List
Previous Message Alexander Kukushkin 2018-08-23 10:06:59 Re: BUG #15346: Replica fails to start after the crash