Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-18 10:59:08
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016/11/18 4:14, Robert Haas wrote:
> On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Meanwhile, here are updated patch that address most of the following comments.
> OK, I have re-reviewed 0005 and it looks basically fine to me, modulo
> a few minor nitpicks. "This is called, *iff*" shouldn't have a comma
> there,


> and I think the entire comment block that starts with "NOTE:
> SQL specifies that a NULL" and ends with "it's unlikely that NULL
> would result." should be changed to say something like /* As for
> catalogued constraints, we treat a NULL result as a success, not a
> failure. */ rather than duplicating an existing comment that doesn't
> quite apply here.

Ah, you're right that the comment does not apply as it is. I rewrote that

Oh but wait, that means I can insert rows with NULLs in the range
partition key if I choose to insert it directly into the partition,
whereas I have been thinking all this while that there could never be
NULLs in the partition key of a range partition. What's more,
get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
every partition key column in case of a range partition. Is that
wrongheaded altogether? (also see my reply to your earlier message about
NULLs in the range partition key)

> Finally, ExecConstraints() contains a new if-block
> whose sole contents are another if-block. Perhaps if (this && that)
> would be better.

Agreed, should have noticed that.

> Regarding 0006 and 0007, I think the PartitionTreeNodeData structure
> you've chosen is awfully convoluted and probably not that efficient.
> For example, get_partition_for_tuple() contains this loop:
> + prev = parent;
> + node = parent->downlink;
> + while (node != NULL)
> + {
> + if (node->index >= cur_idx)
> + break;
> +
> + prev = node;
> + node = node->next;
> + }
> Well, it looks to me like that's an O(n) way to find the n'th
> partition, which seems like a pretty bad idea in performance-critical
> code, which this is. I think this whole structure needs a pretty
> heavy overhaul. Here's a proposal:

Thanks for the idea below!

> 1. Forget the idea of a tree. Instead, let the total number of tables
> in the partitioning hierarchy be N and let the number of those that
> are partitioned be K. Assign each partitioned table in the hierarchy
> an index between 0 and K-1. Make your top level data structure (in
> lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
> objects, with the partitioning root in entry 0 and the rest in the
> remaining entries.
> 2. Within each PartitionDispatch object, store (a) a pointer to a
> PartitionDesc and (b) an array of integers of length equal to the
> PartitionDesc's nparts value. Each integer i, if non-negative, is the
> final return value for get_partition_for_tuple. If i == -1, tuple
> routing fails. If i < -1, we must next route using the subpartition
> whose PartitionDesc is at index -(i+1). Arrange for the array to be
> in the same order the PartitionDesc's OID list.
> 3. Now get_partition_for_tuple looks something like this:
> K = 0
> loop:
> pd = PartitionDispatch[K]
> idx = list/range_partition_for_tuple(pd->partdesc, ...)
> if (idx >= -1)
> return idx
> K = -(idx + 1)
> No recursion, minimal pointer chasing, no linked lists. The whole
> thing is basically trivial aside from the cost of
> list/range_partition_for_tuple itself; optimizing that is a different
> project. I might have some details slightly off here, but hopefully
> you can see what I'm going for: you want to keep the computation that
> happens in get_partition_for_tuple() to an absolute minimum, and
> instead set things up in advance so that getting the partition for a
> tuple is FAST. And you want the data structures that you are using in
> that process to be very compact, hence arrays instead of linked lists.

This sounds *much* better. Here is a quick attempt at coding the design
you have outlined above in the attached latest set of patches.

PS: I haven't run the patches through pgindent yet.


Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-16.patch text/x-diff 115.3 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-16.patch text/x-diff 23.9 KB
0003-Catalog-and-DDL-for-partitions-16.patch text/x-diff 208.2 KB
0004-psql-and-pg_dump-support-for-partitions-16.patch text/x-diff 22.1 KB
0005-Teach-a-few-places-to-use-partition-check-quals-16.patch text/x-diff 30.8 KB
0006-Introduce-a-PartitionDispatch-data-structure-16.patch text/x-diff 6.6 KB
0007-Tuple-routing-for-partitioned-tables-16.patch text/x-diff 33.9 KB
0008-Update-DDL-Partitioning-chapter-to-reflect-new-devel-16.patch text/x-diff 24.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-11-18 11:40:18 Re: Push down more full joins in postgres_fdw
Previous Message Emre Hasegeli 2016-11-18 09:58:27 Re: Floating point comparison inconsistencies of the geometric types