Quick Links

Re: Declarative partitioning

From:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Declarative partitioning
Date:	2016-08-09 07:18:11
Message-ID:	CAFjFpRdvAPAoH+wUF+f4jfui3pKGM6i78DHBT3GTfoUO4Aj=+g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

What strikes me odd about these patches is RelOptInfo has remained
unmodified. For a base partitioned table, I would expect it to be marked as
partitioned may be indicating the partitioning scheme. Instead of that, I
see that the code directly deals with PartitionDesc, PartitionKey which are
part of Relation. It uses other structures like PartitionKeyExecInfo. I
don't have any specific observation as to why we need such information in
RelOptInfo, but lack of it makes me uncomfortable. It could be because
inheritance code does not require any mark in RelOptInfo and your patch
re-uses inheritance code. I am not sure.

On Tue, Aug 9, 2016 at 9:17 AM, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
wrote:

> On 2016/08/09 6:02, Robert Haas wrote:
> > On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
> > <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >>> +1, if we could do it. It will need a change in the way Amit's patch
> stores
> >>> partitioning scheme in PartitionDesc.
> >>
> >> Okay, I will try to implement this in the next version of the patch.
> >>
> >> One thing that comes to mind is what if a user wants to apply hash
> >> operator class equality to list partitioned key by specifying a hash
> >> operator class for the corresponding column. In that case, we would not
> >> have the ordering procedure with an hash operator class, hence any
> >> ordering based optimization becomes impossible to implement. The
> current
> >> patch rejects a column for partition key if its type does not have a
> btree
> >> operator class for both range and list methods, so this issue doesn't
> >> exist, however it could be seen as a limitation.
> >
> > Yes, I think you should expect careful scrutiny of that issue. It
> > seems clear to me that range partitioning requires a btree opclass,
> > that hash partitioning requires a hash opclass, and that list
> > partitioning requires at least one of those things. It would probably
> > be reasonable to pick one or the other and insist that list
> > partitioning always requires exactly that, but I can't see how it's
> > reasonable to insist that you must have both types of opclass for any
> > type of partitioning.
>
> So because we intend to implement optimizations for list partition
> metadata that presuppose existence of corresponding btree operator class,
> we should just always require user to specify one (or error out if user
> doesn't specify and a default one doesn't exist). That way, we explicitly
> do not support specify hash equality operator for list partitioning.
>
> >> So, we have 3 choices for the internal representation of list
> partitions:
> >>
> >> Choice 1 (the current approach): Load them in the same order as they
> are
> >> found in the partition catalog:
> >>
> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
> >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
> >>
> >> In this case, mismatch on the first list would make the two tables
> >> incompatibly partitioned, whereas they really aren't incompatible.
> >
> > Such a limitation seems clearly unacceptable. We absolutely must be
> > able to match up compatible partitioning schemes without getting
> > confused by little details like the order of the partitions.
>
> Agreed. Will change my patch to adopt the below method.
>
> >> Choice 2: Representation with 2 arrays:
> >>
> >> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
> >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
> >>
> >> It still doesn't help the case of pairwise joins because it's hard to
> tell
> >> which value belongs to which partition (the 2nd array carries the
> original
> >> partition numbers). Although it might still work for tuple-routing.
> >
> > It's very good for tuple routing. It can also be used to match up
> > partitions for pairwise joins. Compare the first arrays. If they are
> > unequal, stop. Else, compare the second arrays, incrementally
> > building a mapping between them and returning false if the mapping
> > turns out to be non-bijective. For example, in this case, we look at
> > index 0 and decide that 3 -> 2. We look at index 1 and decide 1 -> 3.
> > We look at index 2 and decide 2 -> 1. We look at index 4 and find
> > that we already have a mapping for 2, but it's compatible because we
> > need 2 -> 1 and that's what is already there. Similarly for the
> > remaining entries. This is really a pretty easy loop to write and it
> > should run very quickly.
>
> I see, it does make sense to try to implement this way.
>
> >> Choice 3: Order all lists' elements for each list individually and then
> >> order the lists themselves on their first values:
> >>
> >> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
> >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
> >>
> >> This representation makes pairing partitions for pairwise joining
> >> convenient but for tuple-routing we still need to visit each partition
> in
> >> the worst case.
> >
> > I think this is clearly not good enough for tuple routing. If the
> > algorithm I proposed above turns out to be too slow for matching
> > partitions, then we could keep both this representation and the
> > previous one. We are not limited to just one. But I don't think
> > that's likely to be the case.
>
> I agree. Let's see how the option 2 turns out.
>
> > Also, note that all of this presupposes we're doing range
> > partitioning, or perhaps list partitioning with a btree opclass. For
> > partitioning based on a hash opclass, you'd organize the data based on
> > the hash values rather than range comparisons.
>
> Yes, the current patch does not implement hash partitioning, although I
> have to think about how to support the hash case when designing the
> internal data structures.
>
>
> By the way, I am planning to start a new thread with the latest set of
> patches which I will post in a day or two. I have tried to implement all
> the bug fixes and improvements that have been suggested on this thread so
> far. Thanks to all those who reviewed and gave their comments. Please
> check this page to get a link to the new thread:
> https://commitfest.postgresql.org/10/611/
>
> Thanks,
> Amit
>
>
>

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Re: Declarative partitioning at 2016-08-09 03:47:38 from Amit Langote

Responses

Re: Declarative partitioning at 2016-08-10 10:18:19 from Ashutosh Bapat

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2016-08-09 07:59:31	Re: Logical Replication WIP
Previous Message	Sridhar N Bamandlapally	2016-08-09 05:24:42	Re: Surprising behaviour of \set AUTOCOMMIT ON