Re: Declarative partitioning - another take

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-17 19:14:24
Message-ID: CA+TgmobF2r=f-crrE-k7WM8iFpBKLz3dtBtEc=KmkudYViYcyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 17, 2016 at 6:27 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Meanwhile, here are updated patch that address most of the following comments.

OK, I have re-reviewed 0005 and it looks basically fine to me, modulo
a few minor nitpicks. "This is called, *iff*" shouldn't have a comma
there, and I think the entire comment block that starts with "NOTE:
SQL specifies that a NULL" and ends with "it's unlikely that NULL
would result." should be changed to say something like /* As for
catalogued constraints, we treat a NULL result as a success, not a
failure. */ rather than duplicating an existing comment that doesn't
quite apply here. Finally, ExecConstraints() contains a new if-block
whose sole contents are another if-block. Perhaps if (this && that)
would be better.

Regarding 0006 and 0007, I think the PartitionTreeNodeData structure
you've chosen is awfully convoluted and probably not that efficient.
For example, get_partition_for_tuple() contains this loop:

+ prev = parent;
+ node = parent->downlink;
+ while (node != NULL)
+ {
+ if (node->index >= cur_idx)
+ break;
+
+ prev = node;
+ node = node->next;
+ }

Well, it looks to me like that's an O(n) way to find the n'th
partition, which seems like a pretty bad idea in performance-critical
code, which this is. I think this whole structure needs a pretty
heavy overhaul. Here's a proposal:

1. Forget the idea of a tree. Instead, let the total number of tables
in the partitioning hierarchy be N and let the number of those that
are partitioned be K. Assign each partitioned table in the hierarchy
an index between 0 and K-1. Make your top level data structure (in
lieu of PartitionTreeNodeData) be an array of K PartitionDispatch
objects, with the partitioning root in entry 0 and the rest in the
remaining entries.

2. Within each PartitionDispatch object, store (a) a pointer to a
PartitionDesc and (b) an array of integers of length equal to the
PartitionDesc's nparts value. Each integer i, if non-negative, is the
final return value for get_partition_for_tuple. If i == -1, tuple
routing fails. If i < -1, we must next route using the subpartition
whose PartitionDesc is at index -(i+1). Arrange for the array to be
in the same order the PartitionDesc's OID list.

3. Now get_partition_for_tuple looks something like this:

K = 0
loop:
pd = PartitionDispatch[K]
idx = list/range_partition_for_tuple(pd->partdesc, ...)
if (idx >= -1)
return idx
K = -(idx + 1)

No recursion, minimal pointer chasing, no linked lists. The whole
thing is basically trivial aside from the cost of
list/range_partition_for_tuple itself; optimizing that is a different
project. I might have some details slightly off here, but hopefully
you can see what I'm going for: you want to keep the computation that
happens in get_partition_for_tuple() to an absolute minimum, and
instead set things up in advance so that getting the partition for a
tuple is FAST. And you want the data structures that you are using in
that process to be very compact, hence arrays instead of linked lists.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-11-17 19:16:31 Re: Fun fact about autovacuum and orphan temp tables
Previous Message Pavel Stehule 2016-11-17 18:53:58 Re: proposal: psql \setfileref