Re: Declarative partitioning

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Declarative partitioning
Date: 2015-10-30 10:08:46
Message-ID: 563341AE.5010207@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015/08/26 23:01, Simon Riggs wrote:
> The following review has been posted through the commitfest application:
>
...

>
> Needs planner work and tests of that. ALTER TABLE etc can wait.
>
> The new status of this patch is: Waiting on Author

Alright, after spending much time on this I came up with the attached. I'm
sincerely sorry this took so long.

I'll describe below what's in those patches.

The DDL and catalogs part are not much different from what I had last
described though I took a few steps to simplify things. I dropped the
multi-level partitioning bit and changed range partition creation syntax a
little. Ability to specify both min and max for every range partition had
made life much difficult to construct and use the internal representation.
Anyway, attached document patch (and regression tests to some degree)
might give a picture of what user commands look like as of now, although
things are very much subject to overhaul in this general area of user
visible commands and features.

As pointed out by Simon, there is a need to invent a better internal
representation of partitioning constraints/metadata to address the
planning considerations more effectively and scalably. To that end, I have
tried to implement a number of things - There is a proposed
set_partitioned_rel_size() that calls planner's partitioning module
(proposed) which takes RelOptInfo of a partitioned table and returns a
list of OIDs of partitions that are selected to be scanned based on the
baserestrictinfo (other partitions are pruned!) The partitioned rel size
is then computed using size estimates for selected partitions. Later,
proposed set_partitioned_rel_pathlist() will create a Append/MergeAppend
path with cheapest paths for individual partitions as its subpaths. It
seems important to point out that the selected partitions each get
simple_rte_array and a simple_rel_array slots (note that those arrays are
required to be repalloc'd to make those slots available). The partition
reloptinfos are of kind RELOPT_PARTITION_REL which are functionally quite
similar to RELOPT_OTHER_MEMBER_REL but different in that they do not
require using any auxiliary structures like AppendRelInfo and
append_rel_list. The parent-partition mapping would still be necessary and
is achieved using a field called parent_relid of partition reloptinfo.

The proposed partitioning module (in the planner) figures out which of the
clauses to use for partition pruning. The rules of matching clauses to
partition key look similar to those used to match them to an index, at
least with respect to the key representation. Key info includes attribute
number, expression tree (for expression key columns), operator family OID,
per key column. While clauses matched to an index would be returned as it
is to make index paths with, things work differently for clauses matched
to partition key column(s). For each matched clause, the Const operand and
the operator would be added to operand list and operator list,
respectively, for the matched column. Operands are later evaluated to
datums. The datum arrays along with operator OIDs are passed along to
another (general-purpose) partitioning module (proposed). That module
returns a list of OIDs of selected partitions after comparing the clause
operand datums with the partition bounds.

The operator strategies (of passed operators) coupled with partitioning
method (list or range, currently) determine how partitions are selected.
For some partitioning methods, faster partition selection is enabled, for
example, being able to use binary search helps make selecting range
partitions reasonably fast. Note that the clauses matched for a given
column are effectively ANDed with one another. At the moment, using OR
clauses disables partition pruning. But it would be a matter of improving
the structure used to pass around the operands and operators to also
convey the logical structure of clauses. Also, even though range
partitioning allows multi-column key, only the clauses referencing the
first key are used to select partitions. Multi-column range partition
pruning implementation is saved for later. Note that partition bounds are
themselves stored as arrays of datums computed on-demand (cached in
RelationData as part of a structure called PartitionDesc which stores
other info about partitions like OIDs.)

There are not many executor changes currently except for those made to
ensure sane I/U/D behaviors (made to respective ExecI/U/D functions). We
can certainly think about implementing specialized scan node for
partitioned tables but I put it off for now.

Attached user documentation and regression test patches, though not quite
comprehensive, should help one get started.

Thanks,
Amit

Attachment Content-Type Size
declarative-partitioning-wip-1.patch.tar.gz application/gzip 71.2 KB
decl-partitioning-docs.patch text/x-diff 19.8 KB
decl-partitioning-tests.patch text/x-diff 55.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2015-10-30 10:19:08 Re: Getting sorted data from foreign server
Previous Message Robert Haas 2015-10-30 09:49:05 Re: [HACKERS] ExclusiveLock on PostgreSQL - Fabio Mendonça