Re: Declarative partitioning - another take

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-18 17:23:01
Message-ID: CA+Tgmobksy=erJSrY55AvZJtaEFJmASmmM3Pmn6nOKP=95ZnLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Oh but wait, that means I can insert rows with NULLs in the range
> partition key if I choose to insert it directly into the partition,
> whereas I have been thinking all this while that there could never be
> NULLs in the partition key of a range partition. What's more,
> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
> every partition key column in case of a range partition. Is that
> wrongheaded altogether? (also see my reply to your earlier message about
> NULLs in the range partition key)

The easiest thing to do might be to just enforce that all of the
partition key columns have to be not-null when the range-partitioned
table is defined, and reject any attempt to DROP NOT NULL on them
later. That's probably better that shoehorning it into the table
constraint.

> Thanks for the idea below!
>
>> 1. Forget the idea of a tree. Instead, let the total number of tables
>> in the partitioning hierarchy be N and let the number of those that
>> are partitioned be K. Assign each partitioned table in the hierarchy
>> an index between 0 and K-1. Make your top level data structure (in
>> lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
>> objects, with the partitioning root in entry 0 and the rest in the
>> remaining entries.
>>
>> 2. Within each PartitionDispatch object, store (a) a pointer to a
>> PartitionDesc and (b) an array of integers of length equal to the
>> PartitionDesc's nparts value. Each integer i, if non-negative, is the
>> final return value for get_partition_for_tuple. If i == -1, tuple
>> routing fails. If i < -1, we must next route using the subpartition
>> whose PartitionDesc is at index -(i+1). Arrange for the array to be
>> in the same order the PartitionDesc's OID list.
>>
>> 3. Now get_partition_for_tuple looks something like this:
>>
>> K = 0
>> loop:
>> pd = PartitionDispatch[K]
>> idx = list/range_partition_for_tuple(pd->partdesc, ...)
>> if (idx >= -1)
>> return idx
>> K = -(idx + 1)
>>
>> No recursion, minimal pointer chasing, no linked lists. The whole
>> thing is basically trivial aside from the cost of
>> list/range_partition_for_tuple itself; optimizing that is a different
>> project. I might have some details slightly off here, but hopefully
>> you can see what I'm going for: you want to keep the computation that
>> happens in get_partition_for_tuple() to an absolute minimum, and
>> instead set things up in advance so that getting the partition for a
>> tuple is FAST. And you want the data structures that you are using in
>> that process to be very compact, hence arrays instead of linked lists.
>
> This sounds *much* better. Here is a quick attempt at coding the design
> you have outlined above in the attached latest set of patches.

That shrank both 0006 and 0007 substantially, and it should be faster,
too. I bet you can shrink them further:

- Why is PartitionKeyExecInfo a separate structure and why does it
have a NodeTag? I bet you can dump the node tag, merge it into
PartitionDispatch, and save some more code and some more
pointer-chasing.

- I still think it's a seriously bad idea for list partitioning and
range partitioning to need different code-paths all over the place
here. List partitions support nulls but not multi-column partitioning
keys and range partitions support multi-column partitioning keys but
not nulls, but you could use an internal structure that supports both.
Then you wouldn't need partition_list_values_bsearch and also
partition_rbound_bsearch; you could have one kind of bound structure
that can be bsearch'd for either list or range. You might even be
able to unify list_partition_for_tuple and range_partition_for_tuple
although that looks a little harder. In either case, you bsearch for
the greatest value <= the value you have. The only difference is that
for list partitioning, you have to enforce at the end that it is an
equal value, whereas for range partitioning less-than-or-equal-to is
enough. But you should still be able to arrange for more code
sharing.

- I don't see why you need the bound->lower stuff any more. If
rangeinfo.bounds[offset] is a lower bound for a partition, then
rangeinfo.bounds[offset+1] is either (a) the upper bound for that
partition and the partition is followed by a "gap" or (b) both the
upper bound for that partition and the lower bound for the next
partition. With the inclusive/exclusive bound stuff gone, every range
bound has the same sense: if the probed value is <= the bound then
we're supposed to be a lower-numbered partition, but if > then we're
supposed to be in this partition or a higher-numbered one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-11-18 17:24:30 Re: Snapshot too old logging
Previous Message Peter Eisentraut 2016-11-18 17:12:06 Re: [RFC] Should we fix postmaster to avoid slow shutdown?