Re: Declarative partitioning

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning
Date: 2016-08-08 21:02:27
Message-ID: CA+TgmoZCr0-t93KgJA3T1uy9yWxfYaSYL3X35ObyHg+ZUfERqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> +1, if we could do it. It will need a change in the way Amit's patch stores
>> partitioning scheme in PartitionDesc.
>
> Okay, I will try to implement this in the next version of the patch.
>
> One thing that comes to mind is what if a user wants to apply hash
> operator class equality to list partitioned key by specifying a hash
> operator class for the corresponding column. In that case, we would not
> have the ordering procedure with an hash operator class, hence any
> ordering based optimization becomes impossible to implement. The current
> patch rejects a column for partition key if its type does not have a btree
> operator class for both range and list methods, so this issue doesn't
> exist, however it could be seen as a limitation.

Yes, I think you should expect careful scrutiny of that issue. It
seems clear to me that range partitioning requires a btree opclass,
that hash partitioning requires a hash opclass, and that list
partitioning requires at least one of those things. It would probably
be reasonable to pick one or the other and insist that list
partitioning always requires exactly that, but I can't see how it's
reasonable to insist that you must have both types of opclass for any
type of partitioning.

>> This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b',
>> 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are
>> essentially same, are represented in different ways. It makes matching
>> partitions for partition-wise join a bit tedius. We have to make sure that
>> the first array matches for both the joining relations and then make sure
>> that all the values belonging to the same partition for one table also
>> belong to the same partition in the other table. Some more complex logic
>> for matching subsets of lists for partition-wise join.
>
> So, we have 3 choices for the internal representation of list partitions:
>
> Choice 1 (the current approach): Load them in the same order as they are
> found in the partition catalog:
>
> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>
> In this case, mismatch on the first list would make the two tables
> incompatibly partitioned, whereas they really aren't incompatible.

Such a limitation seems clearly unacceptable. We absolutely must be
able to match up compatible partitioning schemes without getting
confused by little details like the order of the partitions.

> Choice 2: Representation with 2 arrays:
>
> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>
> It still doesn't help the case of pairwise joins because it's hard to tell
> which value belongs to which partition (the 2nd array carries the original
> partition numbers). Although it might still work for tuple-routing.

It's very good for tuple routing. It can also be used to match up
partitions for pairwise joins. Compare the first arrays. If they are
unequal, stop. Else, compare the second arrays, incrementally
building a mapping between them and returning false if the mapping
turns out to be non-bijective. For example, in this case, we look at
index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3.
We look at index 2 and decide 2 -> 1. We look at index 4 and find
that we already have a mapping for 2, but it's compatible because we
need 2 -> 1 and that's what is already there. Similarly for the
remaining entries. This is really a pretty easy loop to write and it
should run very quickly.

> Choice 3: Order all lists' elements for each list individually and then
> order the lists themselves on their first values:
>
> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>
> This representation makes pairing partitions for pairwise joining
> convenient but for tuple-routing we still need to visit each partition in
> the worst case.

I think this is clearly not good enough for tuple routing. If the
algorithm I proposed above turns out to be too slow for matching
partitions, then we could keep both this representation and the
previous one. We are not limited to just one. But I don't think
that's likely to be the case.

Also, note that all of this presupposes we're doing range
partitioning, or perhaps list partitioning with a btree opclass. For
partitioning based on a hash opclass, you'd organize the data based on
the hash values rather than range comparisons.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ilya Kosmodemiansky 2016-08-08 21:47:11 Re: Wait events monitoring future development
Previous Message Claudio Freire 2016-08-08 20:36:13 Re: Heap WARM Tuples - Design Draft