Re: Declarative partitioning - another take

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-24 05:35:41
Message-ID: CAFjFpRdAv_wSqz6px25sSkgr1wOqXAEvd8vP2R2U1Afv8wXh3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I am trying to create a partitioned table with primary keys on the
partitions. Here's the corresponding syntax as per documentation
CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [
IF NOT EXISTS ] table_name
PARTITION OF parent_table [ (
{ column_name WITH OPTIONS [ column_constraint [ ... ] ]
| table_constraint }
[, ... ]
) ] partition_bound_spec

IIUC, it should allow "create table t1_p1 partition of t1 (a primary
key) ...", (a primary key) is nothing but "column_name
column_constraint", but here's what happens
create table t1_p1 partition of t1 (a primary key) for values from (0) to (100);
ERROR: syntax error at or near "primary"
LINE 1: create table t1_p1 partition of t1 (a primary key) for value...

The same syntax also suggests using table_constraints but that too doesn't work
create table t1_p1 partition of t1 (primary key (a) ) for values
from (0) to (100);
ERROR: inherited relation "t1" is not a table or foreign table

of course t1 is a table, what it isn't?

Am I missing something? How do I define constraints on the partitions?

On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> Updated patches attached. I merged what used to be 0006 and 0007 into one.
>
> On 2016/11/19 2:23, Robert Haas wrote:
>> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> Oh but wait, that means I can insert rows with NULLs in the range
>>> partition key if I choose to insert it directly into the partition,
>>> whereas I have been thinking all this while that there could never be
>>> NULLs in the partition key of a range partition. What's more,
>>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>>> every partition key column in case of a range partition. Is that
>>> wrongheaded altogether? (also see my reply to your earlier message about
>>> NULLs in the range partition key)
>>
>> The easiest thing to do might be to just enforce that all of the
>> partition key columns have to be not-null when the range-partitioned
>> table is defined, and reject any attempt to DROP NOT NULL on them
>> later. That's probably better that shoehorning it into the table
>> constraint.
>
> Agreed that forcing range partitioning columns to be NOT NULL during table
> creation would be a better approach. But then we would have to reject
> using expressions in the range partition key, right?
>
>>> Thanks for the idea below!
>>>
>>>> 1. Forget the idea of a tree. Instead, let the total number of tables
>>>> in the partitioning hierarchy be N and let the number of those that
>
> [ ... ]
>
>>>>
>>>> No recursion, minimal pointer chasing, no linked lists. The whole
>>>> thing is basically trivial aside from the cost of
>>>> list/range_partition_for_tuple itself; optimizing that is a different
>>>> project. I might have some details slightly off here, but hopefully
>>>> you can see what I'm going for: you want to keep the computation that
>>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>>> instead set things up in advance so that getting the partition for a
>>>> tuple is FAST. And you want the data structures that you are using in
>>>> that process to be very compact, hence arrays instead of linked lists.
>>>
>>> This sounds *much* better. Here is a quick attempt at coding the design
>>> you have outlined above in the attached latest set of patches.
>>
>> That shrank both 0006 and 0007 substantially, and it should be faster,
>> too. I bet you can shrink them further:
>
> Some changes described below have reduced the size to a certain degree.
>
>>
>> - Why is PartitionKeyExecInfo a separate structure and why does it
>> have a NodeTag? I bet you can dump the node tag, merge it into
>> PartitionDispatch, and save some more code and some more
>> pointer-chasing.
>
> OK, I merged the fields of what used to be PartitionKeyExecInfo into
> PartitionDispatchData as the latter's new fields key and keystate.
>
>> - I still think it's a seriously bad idea for list partitioning and
>> range partitioning to need different code-paths all over the place
>> here. List partitions support nulls but not multi-column partitioning
>> keys and range partitions support multi-column partitioning keys but
>> not nulls, but you could use an internal structure that supports both.
>> Then you wouldn't need partition_list_values_bsearch and also
>> partition_rbound_bsearch; you could have one kind of bound structure
>> that can be bsearch'd for either list or range. You might even be
>> able to unify list_partition_for_tuple and range_partition_for_tuple
>> although that looks a little harder. In either case, you bsearch for
>> the greatest value <= the value you have. The only difference is that
>> for list partitioning, you have to enforce at the end that it is an
>> equal value, whereas for range partitioning less-than-or-equal-to is
>> enough. But you should still be able to arrange for more code
>> sharing.
>
> I have considered these suggestions in the latest patch. Now instead of
> PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
> there is only one PartitionBoundInfo which consolidates the partition
> bound information of a partitioned table. Some of the fields are
> applicable only to one of list or range case; for example, null-accepting
> list partition index, infinite status of individual range datums.
>
> Also, there is now only one binary search function named
> partition_bound_bsearch() which invokes a comparison function named
> partition_bound_cmp(). The former searches a probe (a partition bound or
> tuple) within a PartitionBoundInfo, which is passed all the way down to
> the comparison function.
>
> Also, we no longer have list_partition_for_tuple() and
> range_partition_for_tuple(). Instead, in get_partition_for_tuple()
> itself, there is a bsearch followed by list and range partitioning
> specific steps based on the returned offset.
>
>> - I don't see why you need the bound->lower stuff any more. If
>> rangeinfo.bounds[offset] is a lower bound for a partition, then
>> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
>> partition and the partition is followed by a "gap" or (b) both the
>> upper bound for that partition and the lower bound for the next
>> partition. With the inclusive/exclusive bound stuff gone, every range
>> bound has the same sense: if the probed value is <= the bound then
>> we're supposed to be a lower-numbered partition, but if > then we're
>> supposed to be in this partition or a higher-numbered one.
>
> OK, I've managed to get rid of lower. At least it is no longer kept in
> the new relcache struct PartitionBoundInfo. It is still kept in
> PartitionRangeBound which is used to hold individual range bounds when
> sorting them (during relcache build). Comparisons invoked during the
> aforementioned sorting step still need to distinguish between lower and
> upper bounds (such that '1)' < '[1').
>
> Tuple-routing no longer needs to look at lower. In that case, what you
> described above applies.
>
> As a result, one change became necessary: to how we flag individual range
> bound datum as infinite or not. Previously, it was a regular Boolean
> value (either infinite or not) and to distinguish +infinity from
> -infinity, we looked at whether the bound is lower or upper (the lower
> flag). Now, instead, the variable holding the status of individual range
> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
> a bool. Upon encountering an infinite range bound datum, whether it's
> negative or positive infinity derives the comparison result. Consider the
> following example:
>
> partition p1 from (1, unbounded) to (1, 1);
> partition p2 from (1, 1) to (1, 10);
> partition p3 from (1, 10) to (1, unbounded);
> partition p4 from (2, unbounded) to (2, 1);
> ... so on
>
> In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
> (1, +inf), so that tuple (1, 15) is assigned to the proper partition.
>
> Does this last thing sound reasonable?
>
> Thanks,
> Amit

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-11-24 05:39:22 Re: Declarative partitioning - another take
Previous Message Pavel Stehule 2016-11-24 05:18:31 Re: patch: function xmltable