Re: [HACKERS] Runtime Partition Pruning

From: Beena Emerson <memissemerson(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, amul sul <sulamul(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: [HACKERS] Runtime Partition Pruning
Date: 2017-12-19 08:44:57
Message-ID: CAOG9ApHBXAqAbaHApTU0c+d7qkG=rD6kPZWXZ2D67wYzMk5Uew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David,

Thank you for reviewing and looking at this.

I have attached the WIP patch which incorporates some of Robert's
comments and is rebased over Amit's v14 patch.

On Sat, Dec 16, 2017 at 11:35 AM, David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On 13 December 2017 at 00:33, Beena Emerson <memissemerson(at)gmail(dot)com> wrote:
>> PFA the updated patch, this can be applied over the v13 patches [1]
>> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3
>
> Hi Beena,
>
> Thanks for posting an updated patch.
>
> I've been looking over this and I think that the use of the
> PartitionDispatch in set_append_subplan_indexes is not correct. What
> we need here is the index of the Append's subnode and that's not what
> RelationGetPartitionDispatchInfo() gives you. Remember that some
> partitions could have been pruned away already during planning.
>
> This quick example shows that the partition selection is not correct.
>
> create table p (a int, b int) partition by range (a);
>
> create table p_a_neg partition of p for values from (minvalue) to (0)
> partition by range (b);
> create table p_a_pos partition of p for values from (0) to (maxvalue)
> partition by range (b);
>
> create table p_a_neg_b_neg partition of p_a_neg for values from
> (minvalue) to (0);
> create table p_a_neg_b_pos partition of p_a_neg for values from (0) to
> (maxvalue);
>
> create table p_a_pos_b_neg partition of p_a_pos for values from
> (minvalue) to (0);
> create table p_a_pos_b_pos partition of p_a_pos for values from (0) to
> (maxvalue);
>
> prepare q1 (int, int) as select * from p where a = $1 and b = $1;
>
> explain analyze execute q1 (-1,-1); -- this works.
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
> Append (cost=0.00..175.60 rows=4 width=8) (actual time=1.099..1.099
> rows=0 loops=1)
> Runtime Partition Pruning: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_neg_b_neg (cost=0.00..43.90 rows=1 width=8)
> (actual time=0.023..0.023 rows=0 loops=1)
> Filter: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_neg_b_pos (cost=0.00..43.90 rows=1 width=8)
> (never executed)
> Filter: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_pos_b_neg (cost=0.00..43.90 rows=1 width=8)
> (never executed)
> Filter: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_pos_b_pos (cost=0.00..43.90 rows=1 width=8)
> (never executed)
> Filter: ((a = $1) AND (b = $1))
> (12 rows)
>
> explain analyze execute q1 (-1,1); -- should scan p_a_neg_b_pos, but does not.
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
> Append (cost=0.00..175.60 rows=4 width=8) (actual
> time=758996.359..758996.359 rows=0 loops=1)
> Runtime Partition Pruning: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_neg_b_neg (cost=0.00..43.90 rows=1 width=8)
> (actual time=0.056..0.056 rows=0 loops=1)
> Filter: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_neg_b_pos (cost=0.00..43.90 rows=1 width=8)
> (never executed)
> Filter: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_pos_b_neg (cost=0.00..43.90 rows=1 width=8)
> (never executed)
> Filter: ((a = $1) AND (b = $1))
> -> Seq Scan on p_a_pos_b_pos (cost=0.00..43.90 rows=1 width=8)
> (never executed)
> Filter: ((a = $1) AND (b = $1))
> (12 rows)
>
> So, I started to look at what the best way to put this right might be.
> I see that since Parallel Append was committed that the subnodes are
> now sorted in cost order inside create_append_path(), so likely we'll
> want to delay figuring out the subpath list indexes until after that's
> done since sorting would scramble our index arrays. We could simply
> look at the subpaths at the end of create_append_path() and create
> some sort of new matrix type that can accept the output of Amit's
> get_partitions_from_clauses() and translate that Bitmapset into the
> subpath indexes (another Bitmapset). This will also need to work for
> sub-partitions too, so this matrix must be some sort of tree that we
> can descend into when we see that get_partitions_from_clauses returned
> a bit for a sub-partition instead of a leaf-partition.

Yes, the change in sort order means that the current
append_paths_array cannot be used for Parallel append and a new logic
has to be devised. I have still not thought about it but your method
seems like a good way to go. Currently I have worked on the Parallel
bit considering that the appends_path_array holds the correct
subplan_index.

>
> I bashed this idea around a bit and I came up with the attached. It's
> very far from complete and in a very WIP state. I've not really done
> anything to make the correct clause list available in nodeAppend.c
> yet, but I think the code that's there is worthy of a look. I've not
> done that much work on the new choose_next_subplan* functions in
> nodeAppend.c. I just modified choose_next_subplan_locally to show how
> this set of functions need to take into account the subnode bitmap set
> of valid partitions to scan. Perhaps some special case is needed to
> have these functions ignore the Bitmapset when runtime pruning is
> disabled (perhaps a completely new set of the functions is needed to
> support the selection of the next non-pruned partition). Although,
> probably that can be debated a bit later as it's a fairly minor detail
> for now.
>
> My patch also lacks any means to extract the Params during
> match_clauses_to_partkey(), or at least most of the cases. I've just
> added 1 case there. I did this because I thought it was better to
> extract the ParamIds rather than a bool to say we've matched params.
> This way we can only reevaluate which subplans to look at on rescan of
> an Append if and only if the params we actually care about have
> changed. I've not given this part a huge amount of thought yet.

Currently, during ReScan the patch prunes whenever the Param changes
and I had this 'rescan pruning optimization' in mind but had not
worked on it.

>
> I'm a little unsure where to go from here. Obviously, this is quite a
> major rewrite of your patch. The parts that I've got missing likely
> can use quite a bit of the stuff you've already written, but that
> needs some review. I wanted to post this now as I know you're busy
> working on this to rebase it on parallel Append and to also address
> Robert's concerns, which all seem valid to me.
>
> What's the best way for us to coordinate our efforts on this? Maybe
> you could look at my patch and sanity check it to ensure I'm not
> taking anything in the wrong direction?

I have not seen the patch in depth but this approach seems good. Dilip
has already worked on the join equality bug you pointed out before, I
am yet to merge that patch and will work on the optimizer comments of
Robert's. Maybe I can incorporate your patch while working on it.

--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Beena Emerson 2017-12-19 08:54:26 Re: [HACKERS] Runtime Partition Pruning
Previous Message Nikhil Sontakke 2017-12-19 08:37:41 Re: [HACKERS] logical decoding of two-phase transactions