Re: Partitioning: issues/ideas (Was: Re: On partitioning)

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Re: Partitioning: issues/ideas (Was: Re: On partitioning)
Date: 2015-01-20 01:48:40
Message-ID: 54BDB3F8.30903@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 17-01-2015 AM 02:34, Robert Haas wrote:
> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> * It has been repeatedly pointed out that we may want to decouple
>> partitioning from inheritance because implementing partitioning as an
>> extension of inheritance mechanism means that we have to keep all the
>> existing semantics which might limit what we want to do with the special
>> case of it which is partitioning; in other words, we would find
>> ourselves in difficult position where we have to inject a special case
>> code into a very generalized mechanism that is inheritance.
>> Specifically, do we regard a partitions as pg_inherits children of its
>> partitioning parent?
>
> I don't think this is totally an all-or-nothing decision. I think
> everyone is agreed that we need to not break things that work today --
> e.g. Merge Append. What that implies for pg_inherits isn't altogether
> clear.
>

One point is that an implementation may end up establishing the
parent-partition hierarchy somewhere other than (or in addition to)
pg_inherits. One intention would be to avoid tying partitioning scheme
to certain inheritance features that use pg_inherits. For example,
consider call sites of find_all_inheritors(). One notable example is
Append/MergeAppend which would be of interest to partitioning. We would
want to reuse that part of the infrastructure but we could might as well
write an equivalent, say find_all_partitions() which scans something
other than pg_inherits to get all partitions.

Now, we may not want to do that and instead add special case code to
prevent partitioning from fiddling with unnecessary inheritance features
in the code paths of inheritance. This seems like an important decision
to make.

>> * Syntax: do we want to make it similar to one of the many other
>> databases out there? Or we could invent our own?
>
> Well, what I think we don't want is something that is *almost* like
> some other database but not quite. I lean toward inventing our own
> since I'm not aware of something that we'd want to copy exactly.
>
>> I wonder if we could add a clause like DISTRIBUTED BY to complement
>> PARTITION ON that represents a hash distributed/partitioned table (that
>> could be a syntax to support sharded tables maybe; we would definitely
>> want to move ahead in that direction I guess)
>
> Maybe eventually, but let's not complicate things by worrying too much
> about that now.
>

Agree that we may not want to mix the two too much at this point.

>> * Catalog: We would like to have a catalog structure suitable to
>> implement capabilities like multi-column partitioning, sub-partitioning
>> (with arbitrary number of levels in the hierarchy). I had suggested
>> that we create two new catalogs viz. pg_partitioned_rel,
>> pg_partition_def to store metadata about a partition key of a
>> partitioned relation and partition bound info of a partition,
>> respectively. Also, see the point about on-disk representation of
>> partition bounds
>
> I'm not convinced that there is any benefit in spreading this
> information across two tables neither of which exist today. If the
> representation of the partitioning scheme is going to be a node tree,
> then there's no point in taking what would otherwise have been a List
> and storing each element of it in a separate tuple. The overarching
> point here is that the system catalog structure should be whatever is
> most convenient for the system internals; I'm not sure we understand
> what that is yet.
>

Agree that some concrete idea of internal representation should help
guide the catalog structure. If we are going to cache the partitioning
info in relcache (which we most definitely will), then we should try to
make sure to consider the scenario of having a lot of partitioned tables
with a lot of individual partitions. It looks like it would be similar
to a scenarios where there are a lot of inheritance hierarchies. But,
availability of partitioning feature would definitely cause these
numbers to grow larger. Perhaps this is an important point driving this
discussion.

I guess this remains tied to the decision we would like make regarding
inheritance (pg_inherits, etc.)

>> * It is desirable to treat partitions as pg_class relations with perhaps
>> a new relkind(s). We may want to choose an implementation where only the
>> lowest level relations in a partitioning hierarchy have storage; those
>> at the upper layers are mere placeholder relations though of course with
>> associated constraints determined by partitioning criteria (with
>> appropriate metadata entered into the additional catalogs).
>
> I think the storage-less parents need a new relkind precisely to
> denote that they have no storage; I am not convinced that there's any
> reason to change the relkind for the leaf nodes. But that's been
> proposed, so evidently someone thinks there's a reason to do it.
>

Again, this remains partly tied to decisions we make regarding catalog
structure.

I am not sure but wouldn't we ever need to tell from a pg_class entry
that a leaf relation has partition bounds associated with it? One reason
I can see that we may not need it is that we would rather use
relispartitioned of a non-leaf relation to trigger finding all its
partitions and their associated bounds; we don't need to know (or
reserve a field for) that a relation has partition bounds associated
with it. The bounds can be stored in pg_partition indexed by relid.
Maybe relkind is not the right field for this anyway.

With that said, would we be comfortable with putting partition key into
pg_class (maybe as a pg_node_tree also encapsulating opclass) so that if
relispartitioned, also look for relpartkey?

>> I am not
>> quite sure if each kind of the relations involved in the partitioning
>> scheme have separate namespaces and, if they are, how we implement that
>
> I am in favor of having all of the nodes in the hierarchy have names
> just as relations do today -- pg_class.relname. Anything else seems
> to me to be complex to implement and of very marginal benefit. But
> again, it's been proposed.
>

The same follows from the my other comments.

>> * In the initial implementation, we could just live with partitioning on
>> a set of columns (and not arbitrary expressions of them)
>
> Seems quite fair.
>
>> * We perhaps do not need multi-column LIST partitions as they are not
>> very widely used and may complicate the implementation
>
> I agree that the use case is marginal; but I'm not sure it needs to
> complicate the implementation much. Depending on how the
> implementation shakes out, prohibiting it might come to seem like more
> of a wart than allowing it.
>

Hmm, I guess implementation may turn out to be generalized enough that
prohibiting it would become a special case and more work.

>> * There are a number of suggestions about how we represent partition
>> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
>> associated with the relation itself), etc. Important point to consider
>> here may be that partition key may contain more than one column
>
> Yep.
>
>> * How we represent partition definition in memory (for a given
>> partitioned relation) - important point to remember is that such a
>> representation should be efficient to iterate through or
>> binary-searchable. Also see the points about tuple-routing and
>> partition-pruning
>
> Yep.
>
>> * Overflow/catchall partition: it seems we do not want/need them. It
>> might seem desirable for example in cases where a big transaction enters
>> a large number of tuples all but one of which find a defined partition;
>> we may not want to error out in such case but instead enter that erring
>> tuple into the overflow partition instead. If we choose to implement
>> that, we would like to also implement the capability to move the tuples
>> into the appropriate partition once it's defined. Related is the notion
>> of automatically creating partitions if one is not already defined for a
>> just entered tuple; but there may be locking troubles if many concurrent
>> sessions try to do that
>
> I think that dynamically creating new partitions is way beyond the
> scope of what this patch should be trying to do. If we ever do it at
> all, it should not be now. The value of a default partition (aka
> overflow partition) seems to me to be debatable. For range
> partitioning, it doesn't seem entirely necessary provided that you can
> define a range with only one endpoint (e.g. partition A has values 1
> to 10, B has 11 and up, and C has 0 and down). For list partitioning,
> though, you might well want something like that. But is it a
> must-have? Dunno.
>
>> * Tuple-routing: based on the internal representation of partition
>> bounds for the partitions of a given partitioned table, there should be
>> a way to map a just entered tuple to partition id it belongs to. Below
>> mentioned BRIN-like machinery could be made to work
>>
>> * Partition-pruning: again, based on the internal representation of
>> partition bounds for the partitions of a given partitioned table, there
>> should be a way to prune partitions deemed unnecessary per scan quals.
>> One notable suggestion is to consider BRIN (-like) machinery. For
>> example, it is able to tell from the scan quals whether a particular
>> block range of a given heap needs to be scanned or not based on summary
>> info index tuple for the block range. Though, the interface is currently
>> suitable to cover a single heap with blocks in range 0 to N-1 of that
>> heap. What we are looking for here is a hypothetical PartitionMemTuple
>> (or PartitionBound) that is a summary of a whole relation (in this case,
>> the partition) NOT a block range. But I guess the infrastructure is
>> generalized enough that we could make that work. Related then would be
>> an equivalent of ScanKey for the partitioning case. Just as ScanKeyData
>> has correspondence with the index being used, the hypothetical
>> PartitionScanKeyData (which may be an entirely bad/half-baked idea!)
>> would represent the application of comparison operator between table
>> column (partitioning key column) and a constant (as per quals).
>
> I'm not going to say this couldn't be done, but how is any of it
> better than having a list of the partition bounds and binary-searching
> it?
>

Of course, my description of it is pretty hand-wavy.

A primary question for me about partition-pruning is when do we do it?
Should we model it after relation_excluded_by_constraints() and hence
totally plan-time? But, the tone of the discussion is that we postpone
partition-pruning to execution-time and hence my perhaps misdirected
attempts to inject it into some executor machinery.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-01-20 01:59:39 Re: B-Tree support function number 3 (strxfrm() optimization)
Previous Message Alvaro Herrera 2015-01-20 01:33:31 Re: B-Tree support function number 3 (strxfrm() optimization)