Re: path toward faster partition pruning

From: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: path toward faster partition pruning
Date: 2017-09-26 14:57:15
Message-ID: 08062340-3e9a-302a-8b63-c4c27aa8196b@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/26/2017 10:33 AM, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen
> <jesper(dot)pedersen(at)redhat(dot)com> wrote:
>> Could you share your thoughts on the usage of PartitionAppendInfo's
>> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
>
> This brings up something that I've kind of been thinking about. There
> are sort of four cases when it comes to partition pruning:
>
> 1. There is exactly one matching partition. For example, this happens
> when there is an equality constraint on every partition column.
>
> 2. There are multiple matching partitions which are consecutive. For
> example, there is a single level of range partitioning with no default
> partition and the single partitioning column is constrained by < > <=
> or >=.
>
> 3. There are multiple matching partitions which are not consecutive.
> This case is probably rare, but it can happen if there is a default
> partition, if there are list partitions with multiple bounds that are
> interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5),
> and the query allows values >= 4 and <= 5), if the query involves OR
> conditions, or if there are multiple levels of partitioning (e.g.
> partition by a, subpartition by b, put a range constraint on a and an
> equality constraint on b).
>
> 4. There are no matching partitions.
>
> One of the goals of this algorithm is to be fast. The obvious way to
> cater to case (3) is to iterate through all partitions and test
> whether each one works, returning a Bitmapset, but that is O(n).
> Admittedly, it might be O(n) with a pretty small constant factor, but
> it still seems like exactly the sort of thing that we want to avoid
> given the desire to scale to higher partition counts.
>
> I propose that we create a structure that looks like this:
>
> struct foo {
> int min_partition;
> int max_partition;
> Bitmapset *extra_partitions;
> };
>
> This indicates that all partitions from min_partition to max_partition
> need to be scanned, and in addition any partitions in extra_partitions
> need to be scanned. Assuming that we only consider cases where all
> partition keys or a leading subset of the partition keys are
> constrained, we'll generally be able to get by with just setting
> min_partition and max_partition, but extra_partitions can be used to
> handle default partitions and interleaved list bounds. For equality
> on all partitioning columns, we can do a single bsearch of the bounds
> to identify the target partition at a given partitioning level, and
> the same thing works for a single range-bound. If there are two
> range-bounds (< and > or <= and >= or whatever) we need to bsearch
> twice. The default partition, if any and if matched, must also be
> included. When there are multiple levels of partitioning things get a
> bit more complex -- if someone wants to knock out a partition that
> breaks up the range, we might need to shrink the main range to cover
> part of it and kick the other indexes out to extra_partitions.
>
> But the good thing is that in common cases with only O(lg n) effort we
> can return O(1) data that describes what will be scanned. In cases
> where that's not practical we expend more effort but still prune with
> maximal effectiveness.
>

For OLTP style applications 1) would be the common case, and with hash
partitions it would be one equality constraint.

So, changing the method signature to use a data type as you described
above instead of the explicit min_datum_idx / max_datum_idx output
parameters would be more clear.

One could advocate (*cough*) that the hash partition patch [1] should be
merged first in order to find other instances of where other CommitFest
entries doesn't account for hash partitions at the moment in their
method signatures; Beena noted something similar in [2]. I know that you
said otherwise [3], but this is CommitFest 1, so there is time for a
revert later, and hash partitions are already useful in internal testing.

[1] https://commitfest.postgresql.org/14/1059/
[2]
https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com
[3] http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html

Best regards,
Jesper

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martín Marqués 2017-09-26 15:13:25 Re: Bug with pg_basebackup and 'shared' tablespace
Previous Message Craig Ringer 2017-09-26 14:50:44 Re: Built-in plugin for logical decoding output