Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: amul sul <sulamul(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2019-08-16 13:25:10
Message-ID: CAPmGK16dJcMJhVQ0pq3KGA_L+OuMTOcynMu-7jMsK06uZkCnqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 30, 2019 at 6:00 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
> On Fri, Jul 19, 2019 at 10:44 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Thu, Jul 18, 2019 at 2:55 AM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
> > > I.e., partition_bounds_merge() is performed for each pair of input
> > > partitioned relations for a join relation in try_partitionwise_join().
> > > Since partition_bounds_merge() would need a lot of CPU cycles, I don't
> > > think this is acceptable; ISTM that some redesign is needed to avoid
> > > this. I'm wondering that once we successfully merged partition bounds
> > > from a pair of input partitioned relations for the join relation, by
> > > using the merged partition bounds, we could get the lists of matching
> > > to-be-joined partitions for subsequent pairs of input partitioned
> > > relations for the join relation in a more efficient way than by
> > > performing partition_bounds_merge() as proposed in the patch.
> >
> > I don't know whether partition_bounds_merge() is well-implemented; I
> > haven't looked.
>
> My concern about that is list partitioning. In that case that
> function calls partition_list_bounds_merge(), which generates the
> partition bounds for a join relation between two given input
> relations, by performing merge join for a pair of the datums arrays
> from both the input relations. Since the datums arrays contain all
> non-null list values across all partitions, if the numbers of the list
> values (ie, ndatums') are large, the merge join would require not a
> few cycles, so it would be much expensive to perform the merge join
> for each such pair when considering large N-way partitionwise joins of
> list-partitioned tables with large ndatums. To see that, I did simple
> tests using a list-partitioned table pt created with the attached,
> which has 10 partitions, each with 1000 list values, so ndatums is
> 10000. (The tests below are performed with
> enable_partitionwise_join=on.)
>
> * 2-way self-join of pt: explain analyze select * from pt t0, pt t1
> where t0.a = t1.a;
> - HEAD:
> Planning Time: 1.731 ms
> Execution Time: 15.159 ms
> - Patched:
> Planning Time: 1.884 ms
> Execution Time: 15.127 ms
>
> * 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3 where t0.a = t1.a and t1.a = t2.a and t2.a = t3.a;
> - HEAD:
> Planning Time: 28.787 ms
> Execution Time: 34.313 ms
> - Patched:
> Planning Time: 40.263 ms
> Execution Time: 35.019 ms
>
> * 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
> pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
> t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
> and t6.a = t7.a;
> - HEAD:
> Planning Time: 2279.653 ms
> Execution Time: 63.303 ms
> - Patched:
> Planning Time: 3834.751 ms
> Execution Time: 62.949 ms
>
> Actually, these joins would not need the partition-matching algorithm
> the patch adds; we could probably avoid this regression by modifying
> the patch to plan these joins the same way as before, but ISTM that
> these results imply that the cost of performing the merge join for
> each such pair would not be negligible when considering large N-way
> partitionwise joins mentioned above. Maybe I'm missing something,
> though.
>
> > But in general I don't see an alternative to doing
> > some kind of merging on each pair of input relations. That's just how
> > planning works, and I don't see why it should need to be prohibitively
> > expensive. I might be missing something, though; do you have an idea?
>
> Yes, I do; but I think I should think a little more about that.

I gave more thought to this. My idea is to generate the list of
matching partitions to be joined from the partbound info after
creating it with partition_bounds_merge(), as I stated before. Let me
explain using an example. Suppose that R, S and T are partitioned
tables, that R=R(a,b,c) is partitioned on ranges of a into three
partitions R1, R2 and R3, that S=S(a,b,c) is partitioned on ranges of
a into three partitions S1, S2 and S3, and that T=T(a,b,c) is
partitioned on ranges of a into three partitions T1, T2 and T3.
Consider a 3-way join query: SELECT * FROM R, S, T WHERE R.a=S.a AND
S.a=T.a; Suppose that when considering 2-way join R IJ S,
partition_bounds_merge() successfully merges the partition bounds for
R and S, and generates join pairs (R1, S1), (R2, S2) and (R3, S3), and
that when considering 3-way join (R IJ S) IJ T, that function
successfully merges the partition bounds for (R IJ S) and T, and
generates join pairs ((R1 IJ S1), T1) and ((R2 IJ S2), T2). The
question here is: do we really need to perform
partition_bounds_merge() to generate the list of matching partitions
to be joined for 3-way join R IJ (S IJ T), for example? I don't think
so; because 1) we see from the 3-way join pairs ((R1 IJ S1), T1) and
((R2 IJ S2), T2) that Ri, Si and Ti (i=1,2) have boundaries
overlapping with each other, which means that there would be (S1, T1)
and (S2, T2) as 2-way join pairs for S IJ T, and 2) we have R IJ (S IJ
T) = (R IJ S) IJ T = (R1 IJ S1) IJ T1 union (R2 IJ S2) IJ T2 = R1 IJ
(S1 IJ T1) union R2 IJ (S2 IJ T2), which means that the list of
matching partitions to be joined for R IJ (S IJ T) in question are
(R1, (S1 IJ T1)) and (R2, (S2 IJ T2)) since we see from the equation
that pairs from R and (S IJ T) other than that would not contribute to
the join result. Does that make sense? Attached is a WIP patch
implementing this created on top of the patch [1].

It seems that I performed the above tests on an assertion-enabled
build. :( So I executed the tests one more time. Here are the
results.

* 2-way self-join of pt: explain analyze select * from pt t0, pt t1
where t0.a = t1.a;
- HEAD:
Planning Time: 0.969 ms
Execution Time: 13.843 ms
- with patch:
Planning Time: 1.720 ms
Execution Time: 14.393 ms
- with patch plus attached:
Planning Time: 1.630 ms
Execution Time: 14.002 ms

* 4-way self-join of pt: explain analyze select * from pt t0, pt t1,
pt t2, pt t3 where t0.a = t1.a

and t1.a = t2.a and t2.a = t3.a;
- HEAD:
Planning Time: 12.203 ms
Execution Time: 31.784 ms
- with patch:
Planning Time: 32.102 ms
Execution Time: 32.504 ms
- with patch plus attached:
Planning Time: 19.471 ms
Execution Time: 32.582 ms

* 8-way self-join of pt: explain analyze select * from pt t0, pt t1,
pt t2, pt t3, pt t4, pt t5, pt t6, pt t7 where t0.a = t1.a and t1.a =
t2.a and t2.a = t3.a and t3.a = t4.a and t4.a = t5.a and t5.a = t6.a
and t6.a = t7.a;
- HEAD:
Planning Time: 948.131 ms
Execution Time: 55.645 ms
- with patch:
Planning Time: 2939.813 ms
Execution Time: 56.760 ms
- with patch plus attached:
Planning Time: 1108.076 ms
Execution Time: 55.750 ms

Note: the attached patch still uses the proposed partition matching
algorithm for these queries. As I said before, these queries don't
need that algorithm, so we could eliminate the planning overhead
compared to HEAD, by planning these queries as before, perhaps, but I
haven't modified the patch as such yet.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK177W%2BjNgpM5f_m-vdDKbEBu_%3D3CyPzFjkT_1nzf1Vqn%2BA%40mail.gmail.com

Attachment Content-Type Size
Modify-partition-matching-algorithm-WIP.patch application/octet-stream 17.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-08-16 13:30:58 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Peter Eisentraut 2019-08-16 13:09:57 Re: Unexpected "shared memory block is still in use"