Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-08-10 12:00:22
Message-ID: CAFjFpRfkr7igCGBBWH1PQ__W-XPy1O79Y-qxCmJc6FizLqFz7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 9, 2017 at 7:09 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> I started reviewing this. It's nicely commented, but it's also very
> complicated, and it's going to take me several rounds to understand
> what all the parts do, but here's some assorted feedback after reading
> some parts of the patches, some tinkering and quite a bit of time
> spent trying to break it (so far unsuccessfully).
>

Thanks for testing the patch. Good to know it has withstood your testing.

> On my computer it took ~1.5 seconds to plan a 1000 partition join,
> ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
> 4000 partition join. I poked around in a profiler a bit and saw that
> for the 2000 partition case I spent almost half the time in
> create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
> and about half of that was in bms_is_subset. The other half the time
> was in query_planner->make_one_rel which spent 2/3 of its time in
> set_rel_size->add_child_rel_equivalences->bms_overlap and the other
> 1/3 in standard_join_search.

Thanks for profiling.

I have separately mailed about bitmapset improvements.

Equivalence classes contain all the expressions which are known to be
equal in EquivalenceClass::ec_members. For a partitioned table, there
will be as many expressions as the number of children. The child
expressions are marked as em_is_child and are looked at only when
child relids are available to the function scanning the members. The
number of equivalence members increases linearly with the number of
partitions, and the number of words in the bitmaps increases linearly
with the number of partitions, effectively the the number of words
scanned increases quadratically. Hence the superlinear increase in
time with the number of partitions. When I took separate profiles with
1000, 2000 and 4000 partitions resp. I see that 15%, 29% and 40% time
spent in bms_is_subset() resp.

I am not sure how much we can do in this patchset to reduce this
problem. Apart from your bitmapset optimization, we could perhaps use
some more efficient data structure other than list to search members
based on the relids OR re-use parent's expressions for child somehow.
I have been thinking about the second option, but never got a chance
to work on it.

>
> When using list-based partitions, it must be possible to omit the part
> of a join key that is implied by the partition because the partition
> has only one list value. For example, if I create a two level
> hierarchy with one partition per US state and then time-based range
> partitions under that, the state part of this merge condition is
> redundant:
>
> Merge Cond: ((sales_wy_2017_10.state =
> purchases_wy_2017_10.state) AND (sales_wy_2017_10.created =
> purchases_wy_2017_10.created))

That's a good idea. In fact, we could use a similar trick when the
condition is sales_wy_2017_10.state = 'state'. We can not use the
trick in case of DML or when there are locking clauses, since we need
to evaluate the qual in case the row underneath changes while locking
it. We also can not do this when one of the keys being compared is a
nullable partition key (a concept explained in partition-wise join
implementation patch), since a partition can have also have rows with
NULL values for such partition keys in that partition.

I think the idea has merit, although, I think we should handle it
targetting more generic cases like the one stated above.

>
> 0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch
>
> -partition_bounds_equal(PartitionKey key,
> +partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
> PartitionBoundInfo b1,
> PartitionBoundInfo b2)
>
> I wonder is there any value in creating a struct to represent the
> common part of PartitionKey and PartitionScheme that functions like
> this and others need? Maybe not. Perhaps you didn't want to make
> PartitionKey contain a PartitionScheme because then you'd lose the
> property that every pointer to PartitionScheme in the system must be a
> pointer to an interned (canonical) PartitionScheme, so it's always
> safe to compare pointers to test for equality?

Right. Other reason to keep those two separate, is we might change the
contents of PartitionScheme as we move forward with the reviews. May
be we should revisit it after we have finalised the design.

>
> 0005-Canonical-partition-scheme.patch:
>
> +/*
> + * get_relation_partition_info
> + *
> + * Retrieves partitioning information for a given relation.
> + *
> + * For a partitioned table it sets partitioning scheme, partition key
> + * expressions, number of partitions and OIDs of partitions in the given
> + * RelOptInfo.
> + */
> +static void
> +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
> + Relation relation)
>
> Would this be better called "set_relation_partition_info"? It doesn't
> really "retrieve" the information, it "installs" it.

Yes. Done.

>
> +{
> + PartitionDesc part_desc;
> +
> + /* No partitioning information for an unpartitioned relation. */
> + if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
> + !(rel->part_scheme = find_partition_scheme(root, relation)))
> + return;
>
> Here and elsewhere you use the idiom !(foo = bar), which is perfectly
> good C in my book but I understand the project convention is to avoid
> implicit pointer->boolean conversion and to prefer expressions like
> (foo = bar) != NULL and there is certainly a lot more code like that.

PG code uses both the styles, search "if (!" in execExpr.c,
createplan.c for example.
I find this style useful, when I want to code, say "if this
relation does not have a partitioning scheme" rather than "if this
relation have NULL partitioning scheme".

>
> 0007-Partition-wise-join-implementation.patch
>
> + {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD,
>
> This GUC should appear in postgresql.conf.sample.

Done.

Attached patches with the comments addressed.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
pg_dp_join_patches_v25.tar.gz application/x-gzip 148.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-10 12:00:44 Re: Server crash (FailedAssertion) due to catcache refcount mis-handling
Previous Message Sokolov Yura 2017-08-10 11:20:17 Re: Walsender timeouts and large transactions