Re: path toward faster partition pruning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: jesper(dot)pedersen(at)redhat(dot)com
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: path toward faster partition pruning
Date: 2017-09-27 05:03:20
Message-ID: 89e98d77-4a08-bd79-b26d-30186f831e20@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Jesper.

Firstly, thanks for looking at the patch.

On 2017/09/26 22:00, Jesper Pedersen wrote:
> Hi Amit,
>
> On 09/15/2017 04:50 AM, Amit Langote wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
>>> I will post rebased patches later today, although I think the overall
>>> design of the patch on the planner side of things is not quite there yet.
>>> Of course, your and others' feedback is greatly welcome.
>>
>> Rebased patches attached.  Because Dilip complained earlier today about
>> clauses of the form (const op var) not causing partition-pruning, I've
>> added code to commute the clause where it is required.  Some other
>> previously mentioned limitations remain -- no handling of OR clauses, no
>> elimination of redundant clauses for given partitioning column, etc.
>>
>> A note about 0001: this patch overlaps with
>> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
>> series that Ashutosh Bapat posted yesterday [1].  Because I implemented
>> the planner-portion of this patch based on what 0001 builds, I'm posting
>> it here.  It might actually turn out that we will review and commit
>> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
>> 0001 if you want to play with the later patches.  I would certainly like
>> to review  0003-Canonical-partition-scheme.patch myself, but won't be able
>> to immediately (see below).
>>
>
> Could you share your thoughts on the usage of PartitionAppendInfo's
> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
>
> I'm looking at get_partitions_for_keys.

Sure. You may have noticed that min_datum_idx and max_datum_idx in
PartitionAppendInfo are offsets in the PartitionDescData.boundinfo.datums
array, of datums that lie within the query-specified range (that is, using
=, >, >=, <, <= btree operators in the query). That array contains bounds
of all partitions sorted in the btree operator class defined order, at
least for list and range partitioning. I haven't (yet) closely looked at
the composition of hash partition datums in PartitionBoundInfo, which
perhaps have different ordering properties (or maybe none) than list and
range partitioning datums.

Now, since they are offsets of datums in PartitionBoundInfo, not indexes
of partitions themselves, their utility outside partition.c might be
questionable. But partition-wise join patch, for example, to determine if
two partitioned tables can be joined partition-wise, is going to check if
PartitionBoundInfos in RelOptInfos of two partitioned tables are
bound-to-bound equal [1]. Partition-pruning may select only a subset of
partitions of each of the joining partitioned tables. Equi-join
requirement for partition-wise join means that the subset of partitions
will be same on both sides of the join. My intent of having
min_datum_idx, max_datum_idx, along with contains_null_partition, and
contains_default_partition in PartitionAppendInfo is to have sort of a
cross check that those values end up being same on both sides of the join
after equi-join requirement has been satisfied. That is,
get_partitions_for_keys will have chosen the same set of partitions for
both partitioned tables and hence will have set the same values for those
fields.

As mentioned above, that may be enough for list and range partitioning,
but since hash partition datums do not appear to have the same properties
as list and range datums [2], we may need an additional field(s) to
describe the hash partition selected by get_partitions_for_keys. I guess
only one field will be enough, that will be the offset in the datums array
of the hash partition chosen for the query or -1 if query quals couldn't
conclusively determine one (maybe not all partition keys were specified to
be hashed or some or all used non-equality operator).

Hope that answers your question at least to some degree. Now, there are
some points Robert mentioned in his reply that I will need to also
consider in the patch, which I'll go do now. :)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRc4UdCYknBai9pBu2GA1h4nZVNPDmzgs4jOkqFamT1huA%40mail.gmail.com

[2] It appears that in the hash partitioning case, unlike list and range
partitioning, PartitionBoundInfo doesn't contain values that are
directly comparable to query-specified constants, but a pair (modulus,
remainder) for each partition. We'll first hash *all* the key values
(mentioned in the query) using the partitioning hash machinery and
then determine the hash partition index by using
(hash % largest_modulus) as offset into the PartitionBoundInfo.indexes
array.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2017-09-27 05:30:33 Re: v10 pg_ctl compatibility
Previous Message Tom Lane 2017-09-27 03:32:52 Re: Multicolumn hash indexes