Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2018-07-19 19:04:22
Message-ID: CA+q6zcVizVudWJmy=XZ4uLA_5p9Rrr-W98K_SghtVQNBpertUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Tue, 17 Jul 2018 at 11:58, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>
> On Sun, Jul 15, 2018 at 11:13 PM, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> >> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >>
> >> On 2018/06/27 22:21, Ashutosh Bapat wrote:
> >> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote
> >> >> Ah, okay. I thought of reporting this because I felt the errors may have
> >> >> to do with changes to the related code in HEAD between May 14 when you
> >> >> last posted the patches and today that you may need to account for in you
> >> >> patches. For instance, there are many diffs like this:
> >> >>
> >> >> Looks like the Result node on top of Append is no longer there after
> >> >> applying your patch.
> >> >
> >> > Yes. They are coming because of a commit which removed Result node on
> >> > top of an Append node. I don't remember exactly which.
> >> >
> >> > I wouldn't worry about those diffs at this time. As I have mentioned
> >> > in earlier mails, the expected output from 0006 is quite large and is
> >> > not supposed to be committed. So, I don't see much value in fixing the
> >> > plans in that output.
> >> >
> >> > Do you see that as a hindrance in reviewing the code changes and tests in 0005?
> >>
> >> I think not. I'll ignore 0006 for now and focus on other patches.
> >
> > Hi,
> >
> > Sorry for my irregular reviews. Unfortunately, the patch need to be rebased
> > again.
>
> I rebased the patches without any problem on top of commit

Thanks!

> > In the meantime I have few more commentaries about range_bounds_merge:
> >
> > * From what I see partition_range_bound_cmp is executed on the same bounds few
> > times to find whether they are overlapping and during the range merging, is
> > it necessary? Probably it would be enough just to compare current ranges once
> > per iteration.
>
> The bounds that are compared in those cases are different. Any two
> bounds are compared only once per iteration. Remember each range has
> two bounds, thus there are total four comparisons possible. In case of
> overlap, we do all four comparisons.

Yep, indeed, now I see.

> > * Just to clarify - the iterating through all the partitions, is it the best
> > way of finding matching ranges? Correct me if I'm wrong, but from what I see
> > in the comments for PartitionBoundInfoData, all the ranges are arranged in
> > increasing order - maybe then it's possible to use binary search for finding
> > a matching range?
>
> The loop in partition_range_bounds_merge() runs as many times as the
> maximum of number of datums from the given partition bounds. So the
> complexity is O(n) where n is the maximum of number of datums. With
> binary search the complexity will increase to O(nlogn). I might be
> missing something here.

Now I'm confused even more. Correct me if I'm wrong, but here is what I see
right now:

* We're trying to solve the standard problem of finding overlapping intervals
from two sets of intervals

* The current implementation implicitly compares every range from one side of a
join with every range from another side, which is O(n^2).

Let's imagine we have two tables:

=# \d+ test1_overlap
Table "public.test1_overlap"
Column | Type | Collation | Nullable | Default | Storage |
--------+---------+-----------+----------+---------+----------+
id | integer | | | | plain |
data | jsonb | | | | extended |
Partition key: RANGE (id)
Partitions: test11_overlap FOR VALUES FROM (200) TO (210),
test12_overlap FOR VALUES FROM (220) TO (230),
test13_overlap FOR VALUES FROM (240) TO (250)

=# \d+ test2_overlap
Table "public.test2_overlap"
Column | Type | Collation | Nullable | Default | Storage |
--------+---------+-----------+----------+---------+----------+
id | integer | | | | plain |
data | jsonb | | | | extended |
Partition key: RANGE (id)
Partitions: test210_overlap FOR VALUES FROM (160) TO (170),
test211_overlap FOR VALUES FROM (180) TO (190),
test21_overlap FOR VALUES FROM (0) TO (10),
test22_overlap FOR VALUES FROM (20) TO (30),
test23_overlap FOR VALUES FROM (200) TO (210),
test24_overlap FOR VALUES FROM (40) TO (50),
test25_overlap FOR VALUES FROM (60) TO (70),
test26_overlap FOR VALUES FROM (80) TO (90),
test27_overlap FOR VALUES FROM (100) TO (110),
test28_overlap FOR VALUES FROM (120) TO (130),
test29_overlap FOR VALUES FROM (140) TO (150)

And the join:

select * from test1_overlap inner join test2_overlap using (id);

Partition wise join will work fine, but what will happen (I see this following
the code under gdb) is that inside the function partition_range_bounds_merge we
start from two partitions:

test11_overlap (200, 210) and test21_overlap (0, 10)

In the comparison loop we figure out that there is no overlap and go to the
ub_cmpval > 0 branch, because test11_overlap is higher that test21_overlap.
Inside this branch we think that we need to handle a missing partition case
(apparently, by mistake, but it works - in case if it's not a missing
partition, there are no any records to join with from a default one). Since in
this case there isn't any default partition, the result is merged = true. After
that partition_range_get_next_lb_index will move us to another partition pair:

test11_overlap (200, 210) and test22_overlap (20, 30)

And so on and so forth until we will reach the partition test23_overlap (200,
210), which would be actually what we're looking for. So basically as I said
above we will iterate over the all partitions, and we could skip some of them
using binary search.

> >
> > * I'm not sure why in this case partition wise join is not applied? Maybe I'm
> > wrong, but I was under the impression that it should work here
> >
> > =# \d+ test1;
> > Table "public.test1"
> > Column | Type | Collation | Nullable | Default | Storage |
> > --------+---------+-----------+----------+---------+----------+
> > data | jsonb | | | | extended |
> > id | integer | | | | plain |
> > Partition key: RANGE (id)
> > Indexes:
> > "test1_idx" btree (id)
> > Partitions: test11 FOR VALUES FROM (0) TO (100),
> > test12 FOR VALUES FROM (100) TO (200),
> > test13 FOR VALUES FROM (200) TO (300)
> >
> > =# \d+ test2;
> > Table "public.test2"
> > Column | Type | Collation | Nullable | Default | Storage |
> > --------+---------+-----------+----------+---------+----------+
> > data | jsonb | | | | extended |
> > id | integer | | | | plain |
> > Partition key: RANGE (id)
> > Indexes:
> > "test2_idx" btree (id)
> > Partitions: test21 FOR VALUES FROM (10) TO (110),
> > test22 FOR VALUES FROM (110) TO (210),
> > test23 FOR VALUES FROM (210) TO (310)
> >
> >
> > I'm confused, since there is only one-to-one mapping of partitions.
>
> In this case, test21 overlaps test11 (10-100) and test12 (100-110),
> test22 overlaps test12 (110-200) and test13(200-210). So, there is no
> one-to-one mapping.

Yep, thanks for the explanation.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-07-19 19:20:53 Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket
Previous Message Alvaro Herrera 2018-07-19 18:29:51 Re: Bug in gin insert redo code path during re-compression of empty gin data leaf pages