Re: path toward faster partition pruning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: path toward faster partition pruning
Date: 2017-10-03 04:12:40
Message-ID: ca4a73ac-4c3f-ed53-a0f7-f4f814a66d48@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David.

Thanks a lot for your review comments and sorry it took me a while to reply.

On 2017/09/28 18:16, David Rowley wrote:
> On 27 September 2017 at 14:22, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>> an author). I slightly tweaked his patch -- renamed the function
>> get_matching_clause to match_clauses_to_partkey, similar to
>> match_clauses_to_index.
>
> Hi Amit,
>
> I've made a pass over the 0001 patch while trying to get myself up to
> speed with the various developments that are going on in partitioning
> right now.
>
> These are just my thoughts from reading over the patch. I understand
> that there's quite a bit up in the air right now about how all this is
> going to work, but I have some thoughts about that too, which I
> wouldn't mind some feedback on as my line of thought may be off.
>
> Anyway, I will start with some small typos that I noticed while reading:
>
> partition.c:
>
> + * telling what kind of NullTest has been applies to the corresponding
>
> "applies" should be "applied"

Will fix.

> plancat.c:
>
> * might've occurred to satisfy the query. Rest of the fields are set in
>
> "Rest of the" should be "The remaining"

Will fix.

> Any onto more serious stuff:
>
> allpath.c:
>
> get_rel_partitions()
>
> I wonder if this function does not belong in partition.c. I'd have
> expected a function to exist per partition type, RANGE and LIST, I'd
> imagine should have their own function in partition.c to eliminate
> partitions
> which cannot possibly match, and return the remainder. I see people
> speaking of HASH partitioning, but we might one day also want
> something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
> routing records to be processed into foreign tables where you never
> need to query them again). It would be good if we could easily expand
> this list and not have to touch files all over the optimizer to do
> that. Of course, there would be other work to do in the executor to
> implement any new partitioning method too.

I think there will have to be at least some code in the optimizer. That
is, the code to match the query to the table's partition keys and using
the constant values therein to then look up the table's partitions. More
importantly, once the partitions (viz. their offsets in the table's
partition descriptor) have been looked up by partition.c, we must be able
to then map them to their planner data structures viz. their
AppendRelInfo, and subsequently RelOptInfo. This last part will have to
be in the optimizer. In fact, that was the role of get_rel_partitions in
the initial versions of the patch, when neither of the code for matching
keys and for pruning using constants was implemented.

One may argue that the first part, that is, matching clauses to match to
the partition key and subsequent lookup of partitions using constants
could both be implemented in partition.c, but it seems better to me to
keep at least the part of matching clauses to the partition keys within
the planner (just like matching clauses to indexes is done by the
planner). Looking up partitions using constants cannot be done outside
partition.c anyway, so that's something we have to implement there.

> I know there's mention of it somewhere about get_rel_partitions() not
> being so smart about WHERE partkey > 20 AND partkey > 10, the code
> does not understand what's more restrictive. I think you could
> probably follow the same rules here that are done in
> eval_const_expressions(). Over there I see that evaluate_function()
> will call anything that's not marked as volatile.

Hmm, unless I've missed it, I don't see in evaluate_function() anything
about determining which clause is more restrictive. AFAIK, such
determination depends on the btree operator class semantics (at least in
the most common cases where, say, ">" means greater than in a sense that
btree code uses it), so I was planning to handle it the way btree code
handles it in _bt_preprocess_keys(). In fact, the existing code in
predtest.c, which makes decisions of the similar vein also relies on btree
semantics. It's OK to do so, because partitioning methods that exist
today and for which we'd like to have such smarts use btree semantics to
partition the data. Also, we won't need to optimize such cases for hash
partitioning anyway.

> I'd imagine, for
> each partition key, you'd want to store a Datum with the minimum and
> maximum possible value based on the quals processed. If either the
> minimum or maximum is still set to NULL, then it's unbounded at that
> end. If you encounter partkey = Const, then minimum and maximum can be
> set to the value of that Const. From there you can likely ignore any
> other quals for that partition key, as if the query did contain
> another qual with partkey = SomeOtherConst, then that would have
> become a gating qual during the constant folding process. This way if
> the user had written WHERE partkey >= 1 AND partkey <= 1 the
> evaluation would end up the same as if they'd written WHERE partkey =
> 1 as the minimum and maximum would be the same value in both cases,
> and when those two values are the same then it would mean just one
> theoretical binary search on a partition range to find the correct
> partition instead of two.

Given the way the patch recognizes a given qual as contributing to the
equal key or minimum key or maximum key, it will not conclude the above to
in fact be an equal key, because that presumably would require comparing
clauses among each other to make such a discovery. I'm slightly hesitant
to add code to do that in the first version of the patch. That is, for
time being let WHERE partkey >= 1 and partkey <= 1 be handled by passing 1
as both minimum and maximum key, which requires two binary searches.
Whereas, WHERE partkey = 1 would require only one. Planner code to get
rid of the extra binary search lookup could come later, IMHO.

> I see in get_partitions_for_keys you've crafted the function to return
> something to identify which partitions need to be scanned. I think it
> would be nice to see a special element in the partition array for the
> NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these
> would be defined constants somewhere. The signature of that function
> is not so pretty and that would likely tidy it up a bit. The matching
> partition indexes could be returned as a Bitmapset, yet, I don't
> really see any code which handles adding the NULL and DEFAULT
> partition in get_rel_partitions() either, maybe I've just not looked
> hard enough yet...

New version of the patch I will post soon cleans up the interface of
get_partitions_for_keys quite a bit; particularly the way selected
partitions are returned, for which I adopted an idea that Robert Haas
mentioned [2]. When it recognizes that a sequence of consecutive
partitions are to be scanned, it will return the starting and ending
offsets as *min_part_idx and *max_part_idx. Those that don't fit this
pattern (of which there should be only a few in many cases) are returned
in a Bitmapset, as a supposedly unordered set of partitioned. Since NULL
and DEFAULT partitions are partitions of this later category, they would
be included the bitmapset if it turns out that the query will need to scan
them after all.

Thanks again.

Regards,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf8269

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zeus Kronion 2017-10-03 04:15:56 Possible SSL improvements for a newcomer to tackle
Previous Message Tom Lane 2017-10-03 03:06:09 Re: Horrible CREATE DATABASE Performance in High Sierra