Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-22 09:15:38
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Updated patches attached. I merged what used to be 0006 and 0007 into one.

On 2016/11/19 2:23, Robert Haas wrote:
> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Oh but wait, that means I can insert rows with NULLs in the range
>> partition key if I choose to insert it directly into the partition,
>> whereas I have been thinking all this while that there could never be
>> NULLs in the partition key of a range partition. What's more,
>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>> every partition key column in case of a range partition. Is that
>> wrongheaded altogether? (also see my reply to your earlier message about
>> NULLs in the range partition key)
> The easiest thing to do might be to just enforce that all of the
> partition key columns have to be not-null when the range-partitioned
> table is defined, and reject any attempt to DROP NOT NULL on them
> later. That's probably better that shoehorning it into the table
> constraint.

Agreed that forcing range partitioning columns to be NOT NULL during table
creation would be a better approach. But then we would have to reject
using expressions in the range partition key, right?

>> Thanks for the idea below!
>>> 1. Forget the idea of a tree. Instead, let the total number of tables
>>> in the partitioning hierarchy be N and let the number of those that

[ ... ]

>>> No recursion, minimal pointer chasing, no linked lists. The whole
>>> thing is basically trivial aside from the cost of
>>> list/range_partition_for_tuple itself; optimizing that is a different
>>> project. I might have some details slightly off here, but hopefully
>>> you can see what I'm going for: you want to keep the computation that
>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>> instead set things up in advance so that getting the partition for a
>>> tuple is FAST. And you want the data structures that you are using in
>>> that process to be very compact, hence arrays instead of linked lists.
>> This sounds *much* better. Here is a quick attempt at coding the design
>> you have outlined above in the attached latest set of patches.
> That shrank both 0006 and 0007 substantially, and it should be faster,
> too. I bet you can shrink them further:

Some changes described below have reduced the size to a certain degree.

> - Why is PartitionKeyExecInfo a separate structure and why does it
> have a NodeTag? I bet you can dump the node tag, merge it into
> PartitionDispatch, and save some more code and some more
> pointer-chasing.

OK, I merged the fields of what used to be PartitionKeyExecInfo into
PartitionDispatchData as the latter's new fields key and keystate.

> - I still think it's a seriously bad idea for list partitioning and
> range partitioning to need different code-paths all over the place
> here. List partitions support nulls but not multi-column partitioning
> keys and range partitions support multi-column partitioning keys but
> not nulls, but you could use an internal structure that supports both.
> Then you wouldn't need partition_list_values_bsearch and also
> partition_rbound_bsearch; you could have one kind of bound structure
> that can be bsearch'd for either list or range. You might even be
> able to unify list_partition_for_tuple and range_partition_for_tuple
> although that looks a little harder. In either case, you bsearch for
> the greatest value <= the value you have. The only difference is that
> for list partitioning, you have to enforce at the end that it is an
> equal value, whereas for range partitioning less-than-or-equal-to is
> enough. But you should still be able to arrange for more code
> sharing.

I have considered these suggestions in the latest patch. Now instead of
PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
there is only one PartitionBoundInfo which consolidates the partition
bound information of a partitioned table. Some of the fields are
applicable only to one of list or range case; for example, null-accepting
list partition index, infinite status of individual range datums.

Also, there is now only one binary search function named
partition_bound_bsearch() which invokes a comparison function named
partition_bound_cmp(). The former searches a probe (a partition bound or
tuple) within a PartitionBoundInfo, which is passed all the way down to
the comparison function.

Also, we no longer have list_partition_for_tuple() and
range_partition_for_tuple(). Instead, in get_partition_for_tuple()
itself, there is a bsearch followed by list and range partitioning
specific steps based on the returned offset.

> - I don't see why you need the bound->lower stuff any more. If
> rangeinfo.bounds[offset] is a lower bound for a partition, then
> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
> partition and the partition is followed by a "gap" or (b) both the
> upper bound for that partition and the lower bound for the next
> partition. With the inclusive/exclusive bound stuff gone, every range
> bound has the same sense: if the probed value is <= the bound then
> we're supposed to be a lower-numbered partition, but if > then we're
> supposed to be in this partition or a higher-numbered one.

OK, I've managed to get rid of lower. At least it is no longer kept in
the new relcache struct PartitionBoundInfo. It is still kept in
PartitionRangeBound which is used to hold individual range bounds when
sorting them (during relcache build). Comparisons invoked during the
aforementioned sorting step still need to distinguish between lower and
upper bounds (such that '1)' < '[1').

Tuple-routing no longer needs to look at lower. In that case, what you
described above applies.

As a result, one change became necessary: to how we flag individual range
bound datum as infinite or not. Previously, it was a regular Boolean
value (either infinite or not) and to distinguish +infinity from
-infinity, we looked at whether the bound is lower or upper (the lower
flag). Now, instead, the variable holding the status of individual range
bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
a bool. Upon encountering an infinite range bound datum, whether it's
negative or positive infinity derives the comparison result. Consider the
following example:

partition p1 from (1, unbounded) to (1, 1);
partition p2 from (1, 1) to (1, 10);
partition p3 from (1, 10) to (1, unbounded);
partition p4 from (2, unbounded) to (2, 1);
... so on

In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
(1, +inf), so that tuple (1, 15) is assigned to the proper partition.

Does this last thing sound reasonable?


Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-17.patch text/x-diff 115.3 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-17.patch text/x-diff 23.9 KB
0003-Catalog-and-DDL-for-partitions-17.patch text/x-diff 206.4 KB
0004-psql-and-pg_dump-support-for-partitions-17.patch text/x-diff 22.1 KB
0005-Teach-a-few-places-to-use-partition-check-quals-17.patch text/x-diff 30.8 KB
0006-Tuple-routing-for-partitioned-tables-17.patch text/x-diff 36.3 KB
0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-17.patch text/x-diff 24.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2016-11-22 09:27:14 Re: Measuring replay lag
Previous Message Ideriha, Takeshi 2016-11-22 08:55:42 Re: Forbid use of LF and CR characters in database and role names