Re: Declarative partitioning - another take

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-09-22 07:32:14
Message-ID: CAFjFpRdYfNVMPYKyXLpUHuGwqC2CjmgtBe9RUeUU9yHX8CO3kw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For list partitions, the ListInfo stores the index maps for values
i.e. the index of the partition to which the value belongs. Those
indexes are same as the indexes in partition OIDs array and come from
the catalogs. In case a user creates two partitioned tables with
exactly same lists for partitions but specifies them in a different
order, the OIDs are stored in the order specified. This means that
index array for these tables come out different. equal_list_info()
works around that by creating an array of mappings and checks whether
that mapping is consistent for all values. This means we will create
the mapping as many times as equal_list_info() is called, which is
expected to be more than the number of time
RelationBuildPartitionDescriptor() is called. Instead, if we
"canonicalise" the indexes so that they come out exactly same for
similarly partitioned tables, we build the mapping only once and
arrange OIDs accordingly.

Here's patch to do that. I have ran make check with this and it didn't
show any failure. Please consider this to be included in your next set
of patches.

That helps partition-wise join as well. For partition-wise join (and
further optimizations for partitioned tables), we create a list of
canonical partition schemes. In this list two similarly partitioned
tables share partition scheme pointer. A join between relations with
same partition scheme pointer can be joined partition-wise. It's
important that the indexes in partition scheme match to the OIDs array
to find matching RelOptInfos for partition-wise join.

On Thu, Sep 22, 2016 at 11:12 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Hi Amit,
> Following sequence of DDLs gets an error
> --
> -- multi-leveled partitions
> --
> CREATE TABLE prt1_l (a int, b int, c varchar) PARTITION BY RANGE(a);
> CREATE TABLE prt1_l_p1 PARTITION OF prt1_l FOR VALUES START (0) END
> (250) PARTITION BY RANGE (b);
> CREATE TABLE prt1_l_p1_p1 PARTITION OF prt1_l_p1 FOR VALUES START (0) END (100);
> CREATE TABLE prt1_l_p1_p2 PARTITION OF prt1_l_p1 FOR VALUES START
> (100) END (250);
> CREATE TABLE prt1_l_p2 PARTITION OF prt1_l FOR VALUES START (250) END
> (500) PARTITION BY RANGE (c);
> CREATE TABLE prt1_l_p2_p1 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0250') END ('0400');
> CREATE TABLE prt1_l_p2_p2 PARTITION OF prt1_l_p2 FOR VALUES START
> ('0400') END ('0500');
> CREATE TABLE prt1_l_p3 PARTITION OF prt1_l FOR VALUES START (500) END
> (600) PARTITION BY RANGE ((b + a));
> ERROR: cannot use column or expression from ancestor partition key
>
> The last statement is trying create subpartitions by range (b + a),
> which contains a partition key from ancestor partition key but is not
> exactly same as that. In fact it contains some extra columns other
> than the ancestor partition key columns. Why do we want to prohibit
> such cases?
>
> On Thu, Sep 15, 2016 at 2:23 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>
>> Hi
>>
>> On 2016/09/09 17:55, Amit Langote wrote:
>>> On 2016/09/06 22:04, Amit Langote wrote:
>>>> Will fix.
>>>
>>> Here is an updated set of patches.
>>
>> An email from Rajkumar somehow managed to break out of this thread.
>> Quoting his message below so that I don't end up replying with patches on
>> two different threads.
>>
>> On 2016/09/14 16:58, Rajkumar Raghuwanshi wrote:
>>> I have Continued with testing declarative partitioning with the latest
>>> patch. Got some more observation, given below
>>
>> Thanks a lot for testing.
>>
>>> -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range
>>> partition.
>>>
>>> create table test_range_bound ( a int) partition by range(a);
>>> --creating a partition to contain records {1,2,3,4}, by default 1 is
>>> inclusive and 5 is exclusive
>>> create table test_range_bound_p1 partition of test_range_bound for values
>>> start (1) end (5);
>>> --now trying to create a partition by explicitly mentioning start is
>>> exclusive to contain records {5,6,7}, here trying to create with START with
>>> 4 as exclusive so range should be 5 to 8, but getting partition overlap
>>> error.
>>> create table test_range_bound_p2 partition of test_range_bound for values
>>> start (4) EXCLUSIVE end (8);
>>> ERROR: partition "test_range_bound_p2" would overlap partition
>>> "test_range_bound_p1"
>>
>> Wow, this is bad. What is needed in this case is "canonicalization" of
>> the range partition bounds specified in the command. Range types do this
>> and hence an equivalent test done with range type values would disagree
>> with the result given by the patch for range partition bounds.
>>
>> select '[1,5)'::int4range && '(4,8]'::int4range as cmp;
>> cmp
>> -----
>> f
>> (1 row)
>>
>> In this case, the second range is converted into its equivalent canonical
>> form viz. '[5, 9)'. Then comparison of bounds 5) and [5 can tell that the
>> ranges do not overlap after all. Range type operators can do this because
>> their code can rely on the availability of a canonicalization function for
>> a given range type. From the range types documentation:
>>
>> """
>> A discrete range type should have a canonicalization function that is
>> aware of the desired step size for the element type. The canonicalization
>> function is charged with converting equivalent values of the range type to
>> have identical representations, in particular consistently inclusive or
>> exclusive bounds. If a canonicalization function is not specified, then
>> ranges with different formatting will always be treated as unequal, even
>> though they might represent the same set of values in reality.
>> """
>>
>> to extend the last sentence:
>>
>> "... or consider two ranges overlapping when in reality they are not
>> (maybe they are really just adjacent)."
>>
>> Within the code handling range partition bound, no such canonicalization
>> happens, so comparison 5) and (4 ends up concluding that upper1 > lower2,
>> hence ranges overlap.
>>
>> To mitigate this, how about we restrict range partition key to contain
>> columns of only those types for which we know we can safely canonicalize a
>> range bound (ie, discrete range types)? I don't think we can use, say,
>> existing int4range_canonical but will have to write a version of it for
>> partitioning usage (range bounds of partitions are different from what
>> int4range_canonical is ready to handle). This approach will be very
>> limiting as then range partitions will be limited to columns of int,
>> bigint and date type only.
>>
>> One more option is we let the user specify the canonicalize function next
>> to the column name when defining the partition key. If not specified, we
>> hard-code one for the types for which we will be implementing a
>> canonicalize function (ie, above mentioned types). In other cases, we
>> just don't have one and hence if an unexpected result occurs when creating
>> a new partition, it's up to the user to realize what happened. Of course,
>> we will be mentioning in the documentation why a canonicalize function is
>> necessary and how to write one. Note that this canonicalize function
>> comes into play only when defining new partitions, it has no role beyond
>> that point.
>>
>>> -- Observation 2 : able to create sub-partition out of the range set for
>>> main table, causing not able to insert data satisfying any of the partition.
>>>
>>> create table test_subpart (c1 int) partition by range (c1);
>>> create table test_subpart_p1 partition of test_subpart for values start (1)
>>> end (100) inclusive partition by range (c1);
>>> create table test_subpart_p1_sub1 partition of test_subpart_p1 for values
>>> start (101) end (200);
>>>
>>> \d+ test_subpart
>>> Table "public.test_subpart"
>>> Column | Type | Modifiers | Storage | Stats target | Description
>>> --------+---------+-----------+---------+--------------+-------------
>>> c1 | integer | | plain | |
>>> Partition Key: RANGE (c1)
>>> Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE
>>>
>>> \d+ test_subpart_p1
>>> Table "public.test_subpart_p1"
>>> Column | Type | Modifiers | Storage | Stats target | Description
>>> --------+---------+-----------+---------+--------------+-------------
>>> c1 | integer | | plain | |
>>> Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE
>>> Partition Key: RANGE (c1)
>>> Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200)
>>>
>>> insert into test_subpart values (50);
>>> ERROR: no partition of relation "test_subpart_p1" found for row
>>> DETAIL: Failing row contains (50).
>>> insert into test_subpart values (150);
>>> ERROR: no partition of relation "test_subpart" found for row
>>> DETAIL: Failing row contains (150).
>>
>> It seems that DDL should prevent the same column being used in partition
>> key of lower level partitions. I don't know how much sense it would make,
>> but being able to use the same column as partition key of lower level
>> partitions may be a feature useful to some users if they know what they
>> are doing. But this last part doesn't sound like a good thing. I
>> modified the patch such that lower level partitions cannot use columns
>> used by ancestor tables.
>>
>>> -- Observation 3 : Getting cache lookup failed, when selecting list
>>> partition table containing array.
>>>
>>> CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j);
>>> CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}');
>>> CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}');
>>>
>>> INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1);
>>> INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2);
>>>
>>> postgres=# SELECT tableoid::regclass,* FROM test_array_p1;
>>> tableoid | i | j | k
>>> ---------------+---+-----+-----
>>> test_array_p1 | 1 | {1} | {1}
>>> (1 row)
>>>
>>> postgres=# SELECT tableoid::regclass,* FROM test_array_p2;
>>> tableoid | i | j | k
>>> ---------------+---+-------+-----
>>> test_array_p2 | 2 | {2,2} | {2}
>>> (1 row)
>>>
>>> postgres=# SELECT tableoid::regclass,* FROM test_array;
>>> ERROR: cache lookup failed for type 0
>>
>> That's a bug. Fixed in the attached patch.
>>
>> PS: I'm going to have limited Internet access during this weekend and over
>> the next week, so responses could be slow. Sorry about that.
>>
>> Thanks,
>> Amit
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
canonical_partition_indexes.patch invalid/octet-stream 4.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2016-09-22 08:00:18 Re: Tuplesort merge pre-reading
Previous Message Pavel Stehule 2016-09-22 07:28:21 Re: PL/Python adding support for multi-dimensional arrays