Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To: amul sul <sulamul(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2019-10-16 12:50:17
Message-ID: CAPmGK145V8DNCNQ2gQBgNE3QqoJGjsmK5WMwaA3FMirNM723KQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 25, 2019 at 12:59 AM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
> I will continue to review the rest of the patch.

I've been reviewing the rest of the patch. Here are my review comments:

* map_and_merge_partitions() checks whether the two partitions from
the outer and inner sides can be merged in two steps: 1) see if the
partitions are mapped to each other (ie, partmap1->from == index2 &&
partmap2->from == index1), and 2) see if the merged partition indexes
assigned are the same (ie, partmap1->to == partmap2->to) (or satisfy
some other conditions), but the first step seems redundant to me
because I think that if the merged partition indexes are the same,
then the partitions would be guaranteed to be mapped to each other.
Also, I noticed that that function can't handle some list-partitioning
cases properly. Here is an example:

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0002');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004');
INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (0, 2, 3, 4);
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0001', '0002');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004');
INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 2, 3, 4);
ANALYZE plt2;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c FROM plt1 t1 FULL JOIN plt2 t2 ON (t1.c
= t2.c) WHERE COALESCE(t1.a, 0) % 5 != 3 AND COALESCE(t1.a, 0) % 5 !=
4 ORDER BY t1.c, t1.a, t2.a;
QUERY PLAN

-------------------------------------------------------------------------------
------
Sort
Sort Key: t1.c, t1.a, t2.a
-> Hash Full Join
Hash Cond: (t1.c = t2.c)
Filter: (((COALESCE(t1.a, 0) % 5) <> 3) AND ((COALESCE(t1.a,
0) % 5) <> 4))
-> Append
-> Seq Scan on plt1_p1 t1
-> Seq Scan on plt1_p2 t1_1
-> Hash
-> Append
-> Seq Scan on plt2_p1 t2
-> Seq Scan on plt2_p2 t2_1
(12 rows)

This should use partitionwise join by the new partition-matching
algorithm but doesn't. The reason for that is because plt1_p1 and
plt2_p1 are mapped to different merged partitions and thus considered
not merge-able by map_and_merge_partitions() as-is. I might be
missing something, but I don't think this is intentional, so I rewrote
that function completely in the attached, which is a WIP patch created
on top of the patch [1].

* In handle_missing_partition(), I noticed this:

+ else if (missing_side_inner)
+ {
+ /*
+ * If this partition has already been mapped (say because we
+ * found an overlapping range earlier), we know where does it
+ * fit in the join result. Nothing to do in that case. Else
+ * create a new merged partition.
+ */
+ PartitionMap *extra_map = &maps_with_extra[extra_part];
+ if (extra_map->to < 0)
+ {
+ extra_map->to = *next_index;
+ *next_index = *next_index + 1;
+ *merged_index = extra_map->to;
+ }
+ }

As commented, that function skips setting *merged_index when the
"extra_part" partition is already mapped to a merged partition. This
would be correct for range partitioning, but not for list
partitioning, I think. Here is an example:

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0001', '0002');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0003', '0004');
INSERT INTO plt1 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i;
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0003', '0004');
INSERT INTO plt2 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (2, 3, 4);
ANALYZE plt2;
CREATE TABLE plt3 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt3_p1 PARTITION OF plt3 FOR VALUES IN ('0001');
CREATE TABLE plt3_p2 PARTITION OF plt3 FOR VALUES IN ('0003', '0004');
INSERT INTO plt3 SELECT i, i, to_char(i % 5, 'FM0000') FROM
generate_series(0, 24) i WHERE i % 5 IN (1, 3, 4);
ANALYZE plt3;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.a, t2.c, t3.a, t3.c FROM (plt1 t1 LEFT JOIN plt2
t2 ON (t1.c = t2.c)) FULL JOIN plt3 t3 ON (t1.c = t3.c) WHERE
COALESCE(t1.a, 0) % 5 != 3 AND coalesce(t1.a, 0) % 5 != 4 ORDER BY
t1.c, t1.a, t2.a, t3.a;
QUERY PLAN

-------------------------------------------------------------------------------
------
Sort
Sort Key: t1.c, t1.a, t2.a, t3.a
-> Hash Full Join
Hash Cond: (t1.c = t3.c)
Filter: (((COALESCE(t1.a, 0) % 5) <> 3) AND ((COALESCE(t1.a,
0) % 5) <> 4))
-> Hash Left Join
Hash Cond: (t1.c = t2.c)
-> Append
-> Seq Scan on plt1_p1 t1
-> Seq Scan on plt1_p2 t1_1
-> Hash
-> Append
-> Seq Scan on plt2_p1 t2
-> Seq Scan on plt2_p2 t2_1
-> Hash
-> Append
-> Seq Scan on plt3_p1 t3
-> Seq Scan on plt3_p2 t3_1
(18 rows)

I think this should use 3-way partitionwise join by the new algorithm
but doesn't. The reason for that is because when considering
partitionwise join for plt1 and plt2, partition_list_bounds_merge()
would incorrectly produce {'0000', '0002'} as the merged values for
the join segment of plt1_p1 and plt2_p1, not {'0000', '0001', '0002'},
because that function would call handle_missing_partition() for the
values '0000' and '0001' of plt1_p1, and it would set *merged_index
correctly for '0000' but not for '0001' (keeping *merged_index=-1),
due to the behavior mentioned above, resulting in the merged values
for the join segment {'0000', '0002'}, not {'0000', '0001', '0002'}.
This would cause to fail to merge the values for the join segment
{'0000', '0002'} and the values for plt3_p1 {'0001'} when considering
partitionwise join for the 2-way join and plt3. I fixed this as well
in the attached. handle_missing_partition() can handle both cases
where a datum is missing from the inner side and where a datum is
missing from the outer side in a unified way, but I think that that
makes the code pretty hard to read. Also, I think
handle_missing_partition() is not caller-friendly because the caller
needs to write something like this:

+ /* A range missing from the inner side. */
+ bool missing_side_outer;
+ bool missing_side_inner;

+ /*
+ * For a FULL join, inner relation acts as both OUTER and INNER
+ * relation. For LEFT and ANTI join the inner relation acts as
+ * INNER relation. For INNER and SEMI join OUTER and INNER
+ * differentiation is immaterial.
+ */
+ missing_side_inner = (jointype == JOIN_FULL ||
+ jointype == JOIN_LEFT ||
+ jointype == JOIN_ANTI);
+ missing_side_outer = (jointype == JOIN_FULL);
+ if (!handle_missing_partition(inner_maps,
+ outer_maps,
+ inner_default,
+ outer_part,
+ missing_side_outer,
+ missing_side_inner,
+ &next_index,
+ &default_index,
+ &merged_index))
+ return NULL;

So I'd like to propose to introduce separate functions like
process_outer_partition() and process_inner_partition() in the
attached, instead of handle_missing_partition(). (I added a fast path
to these functions that if both outer/inner sides have the default
partitions, give up on merging partitions. Also, I modified the
caller sites of proposed functions so that they call these if
necessary.)

Other minor changes in the attached:

* I modified partition_range_bounds_merge() and
partition_list_bounds_merge() so that the order of arguments for them
matches partition_bounds_merge(). Also, I did some cleanup for these
functions, and removed this in these function because this merely
re-checks the while condition above.

+ /* If we exhausted both the sides, we won't enter the loop. */
+ Assert(!finished_inner || !finished_outer);

* I modified merge_null_partitions() and merge_default_partitions()
accordingly to the changes described above. Also, I did some cleanup
and modified the caller sites of these functions so that they call
these functions if necessary.

* I also modified generate_matching_part_pairs() accordingly.

* I added a creanup to free memory allocated by init_partition_map().

* We have this in a few places such as merge_default_partitions():

+ extra_map->to = *next_index;
+ *next_index = *next_index + 1;
+ *merged_index = extra_map->to;

To reduce code duplication, I made this into a separate function.

* I think the return statement in the end of partition_range_merge()
is useless, so I removed it.

That is all for now. Will continue to review.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK14WHKckT1P6UJV2B63TZAxPyMn8iZJ99XF=AZuNhG6vow@mail.gmail.com

Attachment Content-Type Size
modify-partbounds-changes-1.patch application/octet-stream 67.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-10-16 12:53:56 Re: v12.0: segfault in reindex CONCURRENTLY
Previous Message Tom Lane 2019-10-16 12:32:48 Re: configure fails for perl check on CentOS8