Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Declarative partitioning - another take
Date: 2016-12-13 08:45:41
Message-ID: 247cb710-31ec-5f11-a6f0-e7e1ecdd6dbf@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016/12/13 0:17, Tomas Vondra wrote:
> On 12/12/2016 07:37 AM, Amit Langote wrote:
>>
>> Hi Tomas,
>>
>> On 2016/12/12 10:02, Tomas Vondra wrote:
>>>
>>> 2) I'm wondering whether having 'table' in the catalog name (and also in
>>> the new relkind) is too limiting. I assume we'll have partitioned indexes
>>> one day, for example - do we expect to use the same catalogs?
>>
>> I am not sure I understand your idea of partitioned indexes, but I doubt
>> it would require entries in the catalog under consideration. Could you
>> perhaps elaborate more?
>>
>
> OK, let me elaborate. Let's say we have a partitioned table, and I want to
> create an index. The index may be either "global" i.e. creating a single
> relation for data from all the partitions, or "local" (i.e. partitioned
> the same way as the table).
>
> Local indexes are easier to implement (it's essentially what we have now,
> except that we need to create the indexes manually for each partition),
> and don't work particularly well for some use cases (e.g. unique
> constraints). This is what I mean by "partitioned indexes".
>
> If the index is partitioned just like the table, we probably won't need to
> copy the partition key info (so, nothing in pg_partitioned_table).
> I'm not sure it makes sense to partition the index differently than the
> table - I don't see a case where that would be useful.
>
> The global indexes would work better for the unique constraint use case,
> but it clearly contradicts our idea of TID (no information about which
> partition that references).
>
> So maybe the catalog really only needs to track info about tables? Not
> sure. I'm just saying it'd be unfortunate to have _table in the name, and
> end up using it for indexes too.

Hmm, I didn't quite think of the case where the index is partitioned
differently from the table, but perhaps that's possible with some other
databases.

What you describe as "local indexes" or "locally partitioned indexes" is
something I would like to see being pursued in the near term. In that
case, we would allow defining indexes on the parent that are recursively
defined on the partitions and marked as inherited index, just like we have
inherited check constraints and NOT NULL constraints. I have not studied
whether we could implement (globally) *unique* indexes with this scheme
though, wherein the index key is a superset of the partition key.

>>> 5) Half the error messages use 'child table' while the other half uses
>>> 'partition'. I think we should be using 'partition' as child tables really
>>> have little meaning outside inheritance (which is kinda hidden behind the
>>> new partitioning stuff).
>>
>> One way to go about it may be to check all sites that can possibly report
>> an error involving child tables (aka "partitions") whether they know from
>> the context which name to use. I think it's possible, because we have
>> access to the parent relation in all such sites and looking at the relkind
>> can tell whether to call child tables "partitions".
>>
>
> Clearly, this is a consequence of building the partitioning on top of
> inheritance (not objecting to that approach, merely stating a fact).
>
> I'm fine with whatever makes the error messages more consistent, if it
> does not make the code significantly more complex. It's a bit confusing
> when some use 'child tables' and others 'partitions'. I suspect even a
> single DML command may return a mix of those, depending on where exactly
> it fails (old vs. new code).

So, we have mostly some old DDL (CREATE/ALTER TABLE) and maintenance
commands that understand inheritance. All of the their error messages
apply to partitions as well, wherein they will be referred to as "child
tables" using old terms. We now have some cases where the commands cause
additional error messages for only partitions because of additional
restrictions that apply to them. We use "partitions" for them because
they are essentially new error messages.

There won't be a case where single DML command would mix the two terms,
because we do not allow mixing partitioning and regular inheritance.
Maybe I misunderstood you though.

>>> So if I understand the plan correctly, we first do a parallel scan of the
>>> parent, then partition_1, partition_2 etc. But AFAIK we scan the tables in
>>> Append sequentially, and each partition only has 1000 rows each, making
>>> the parallel execution rather inefficient. Unless we scan the partitions
>>> concurrently.
>>>
>>> In any case, as this happens even with plain inheritance, it's probably
>>> more about the parallel query than about the new partitioning patch.
>>
>> Yes, I have seen some discussion [2] about a Parallel Append, which would
>> result in plans you're probably thinking of.
>>
>
> Actually, I think that thread is about allowing partial paths even if only
> some appendrel members support it. My point is that in this case it's a
> bit silly to even build the partial paths, when the partitions only have
> 1000 rows each.
>
> I kinda suspect we only look at the total appendrel rowcount estimate, but
> haven't checked.

Oh, I misread. Anyway, Parallel Append is something to think about.

>>> 3) The last issue I noticed is that while
>>>
>>> EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;
>>>
>>> works just fine (it takes fair amount of time to plan with 10k partitions,
>>> but that's expected), this
>>>
>>> EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;
>>>
>>> allocates a lot of memory (~8GB on my laptop, before it gets killed by OOM
>>> killer). Again, the same thing happens with plain inheritance-based
>>> partitioning, so it's probably not a bug in the partitioning patch.
>>>
>>> I'm mentioning it here because I think the new partitioning will hopefully
>>> get more efficient and handle large partition counts more efficiently (the
>>> inheritance only really works for ~100 partitions, which is probably why
>>> no one complained about OOM during UPDATEs). Of course, 10k partitions is
>>> a bit extreme (good for testing, though).
>>
>> Plans with the inheritance parents as target tables (update/delete)
>> go through inheritance_planner() in the optimizer, which currently
>> has a design that is based on certain assumptions about traditional
>> inheritance. It's possible hopefully to make it less expensive for
>> the partitioned tables.
>>
>
> Yes, I know. I wasn't really reporting it as a bug in the partitioning
> patch, but more as a rather surprising difference between plain SELECT and
> UPDATE.
>
> Am I right that one of the ambitions of the new partitioning is to improve
> behavior with large number of partitions?

Yes. Currently, SELECT planning is O(n) with significantly large constant
factor. It is possible now to make it O(log n). Also, if we can do away
with inheritance_planner() treatment for the *partitioned tables* in case
of UPDATE/DELETE, then that would be great. That would mean their
planning time would be almost same as the SELECT case.

As you might know, we have volunteers to make this happen sooner [1], :)

> At first I thought it's somewhat related to the FDW sharding (each node
> being a partition and having local subpartitions), but I realize the
> planner will only deal with the node partitions I guess.

Yeah, planner would only have the local partitioning metadata at its
disposal. Foreign tables can only be leaf partitions, which if need to be
scanned for a given query, will be scanned using a ForeignScan.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/426b2b01-61e0-43aa-bd84-c6fcf516f1c3%40postgrespro.ru

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2016-12-13 10:06:00 Re: Typmod associated with multi-row VALUES constructs
Previous Message Andrew Borodin 2016-12-13 08:35:12 Re: pg_background contrib module proposal