Re: Declarative partitioning

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning
Date: 2016-08-10 10:18:19
Message-ID: CAFjFpRcZoxr+F-OjHu77gj2EGUCJBwibNUc655n-_2afi2XBnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

FOR VALUE clause of a partition does not allow a constant expression like
(10000/5 -1). It gives syntax error
regression=# create table pt1_p1 partition of pt1 for values start (0) end
((10000/5) - 1);
ERROR: syntax error at or near "("
LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((10000/5) ...

Shouldn't we allow constant expressions here?

If this has been already discussed, please forgive me and point out the
relevant mail chain.

On Tue, Aug 9, 2016 at 12:48 PM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

> What strikes me odd about these patches is RelOptInfo has remained
> unmodified. For a base partitioned table, I would expect it to be marked as
> partitioned may be indicating the partitioning scheme. Instead of that, I
> see that the code directly deals with PartitionDesc, PartitionKey which are
> part of Relation. It uses other structures like PartitionKeyExecInfo. I
> don't have any specific observation as to why we need such information in
> RelOptInfo, but lack of it makes me uncomfortable. It could be because
> inheritance code does not require any mark in RelOptInfo and your patch
> re-uses inheritance code. I am not sure.
>
> On Tue, Aug 9, 2016 at 9:17 AM, Amit Langote <
> Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
>> On 2016/08/09 6:02, Robert Haas wrote:
>> > On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
>> > <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> >>> +1, if we could do it. It will need a change in the way Amit's patch
>> stores
>> >>> partitioning scheme in PartitionDesc.
>> >>
>> >> Okay, I will try to implement this in the next version of the patch.
>> >>
>> >> One thing that comes to mind is what if a user wants to apply hash
>> >> operator class equality to list partitioned key by specifying a hash
>> >> operator class for the corresponding column. In that case, we would
>> not
>> >> have the ordering procedure with an hash operator class, hence any
>> >> ordering based optimization becomes impossible to implement. The
>> current
>> >> patch rejects a column for partition key if its type does not have a
>> btree
>> >> operator class for both range and list methods, so this issue doesn't
>> >> exist, however it could be seen as a limitation.
>> >
>> > Yes, I think you should expect careful scrutiny of that issue. It
>> > seems clear to me that range partitioning requires a btree opclass,
>> > that hash partitioning requires a hash opclass, and that list
>> > partitioning requires at least one of those things. It would probably
>> > be reasonable to pick one or the other and insist that list
>> > partitioning always requires exactly that, but I can't see how it's
>> > reasonable to insist that you must have both types of opclass for any
>> > type of partitioning.
>>
>> So because we intend to implement optimizations for list partition
>> metadata that presuppose existence of corresponding btree operator class,
>> we should just always require user to specify one (or error out if user
>> doesn't specify and a default one doesn't exist). That way, we explicitly
>> do not support specify hash equality operator for list partitioning.
>>
>> >> So, we have 3 choices for the internal representation of list
>> partitions:
>> >>
>> >> Choice 1 (the current approach): Load them in the same order as they
>> are
>> >> found in the partition catalog:
>> >>
>> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
>> >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>> >>
>> >> In this case, mismatch on the first list would make the two tables
>> >> incompatibly partitioned, whereas they really aren't incompatible.
>> >
>> > Such a limitation seems clearly unacceptable. We absolutely must be
>> > able to match up compatible partitioning schemes without getting
>> > confused by little details like the order of the partitions.
>>
>> Agreed. Will change my patch to adopt the below method.
>>
>> >> Choice 2: Representation with 2 arrays:
>> >>
>> >> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
>> >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>> >>
>> >> It still doesn't help the case of pairwise joins because it's hard to
>> tell
>> >> which value belongs to which partition (the 2nd array carries the
>> original
>> >> partition numbers). Although it might still work for tuple-routing.
>> >
>> > It's very good for tuple routing. It can also be used to match up
>> > partitions for pairwise joins. Compare the first arrays. If they are
>> > unequal, stop. Else, compare the second arrays, incrementally
>> > building a mapping between them and returning false if the mapping
>> > turns out to be non-bijective. For example, in this case, we look at
>> > index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3.
>> > We look at index 2 and decide 2 -> 1. We look at index 4 and find
>> > that we already have a mapping for 2, but it's compatible because we
>> > need 2 -> 1 and that's what is already there. Similarly for the
>> > remaining entries. This is really a pretty easy loop to write and it
>> > should run very quickly.
>>
>> I see, it does make sense to try to implement this way.
>>
>> >> Choice 3: Order all lists' elements for each list individually and then
>> >> order the lists themselves on their first values:
>> >>
>> >> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
>> >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>> >>
>> >> This representation makes pairing partitions for pairwise joining
>> >> convenient but for tuple-routing we still need to visit each partition
>> in
>> >> the worst case.
>> >
>> > I think this is clearly not good enough for tuple routing. If the
>> > algorithm I proposed above turns out to be too slow for matching
>> > partitions, then we could keep both this representation and the
>> > previous one. We are not limited to just one. But I don't think
>> > that's likely to be the case.
>>
>> I agree. Let's see how the option 2 turns out.
>>
>> > Also, note that all of this presupposes we're doing range
>> > partitioning, or perhaps list partitioning with a btree opclass. For
>> > partitioning based on a hash opclass, you'd organize the data based on
>> > the hash values rather than range comparisons.
>>
>> Yes, the current patch does not implement hash partitioning, although I
>> have to think about how to support the hash case when designing the
>> internal data structures.
>>
>>
>> By the way, I am planning to start a new thread with the latest set of
>> patches which I will post in a day or two. I have tried to implement all
>> the bug fixes and improvements that have been suggested on this thread so
>> far. Thanks to all those who reviewed and gave their comments. Please
>> check this page to get a link to the new thread:
>> https://commitfest.postgresql.org/10/611/
>>
>> Thanks,
>> Amit
>>
>>
>>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2016-08-10 10:29:11 Re: Declarative partitioning
Previous Message Ants Aasma 2016-08-10 10:03:54 Re: Proposal for CSN based snapshots