Re: Declarative partitioning - another take

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Declarative partitioning - another take
Date: 2016-12-12 01:02:00
Message-ID: 22dd313b-d7fd-22b5-0787-654845c8f849@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 12/07/2016 07:20 PM, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>>> My bad. The fix I sent last night for one of the cache flush issues
>>> wasn't quite right. The attached seems to fix it.
>> Yes, fixed here too. Thanks.
>
> Thanks for the report - that was a good catch.
>
> I've committed 0001 - 0006 with that correction and a few other
> adjustments. There's plenty of room for improvement here, and almost
> certainly some straight-up bugs too, but I think we're at a point
> where it will be easier and less error-prone to commit follow on
> changes incrementally rather than by continuously re-reviewing a very
> large patch set for increasingly smaller changes.
>

I've been working on a review / testing of the partitioning patch, but
have been unable to submit it before the commit due to a lot of travel.
However, at least some of the points seem to be still valid, so let me
share it as an after-commit review. Most of the issues are fairly minor
(possibly even nitpicking).

review
------

1) src/include/catalog/pg_partitioned_table.h contains this bit:

* $PostgreSQL: pgsql/src/include/catalog/pg_partitioned_table.h $

2) I'm wondering whether having 'table' in the catalog name (and also in
the new relkind) is too limiting. I assume we'll have partitioned
indexes one day, for example - do we expect to use the same catalogs?

3) A comment within BeginCopy (in copy.c) says:

* Initialize state for CopyFrom tuple routing. Watch out for
* any foreign partitions.

But the code does not seem to be doing that (at least I don't see any
obvious checks for foreign partitions). Also, the comment should
probably at least mention why foreign partitions need extra care.

To nitpick, the 'pd' variable in that block seems rather pointless - we
can assign directly to cstate->partition_dispatch_info.

4) I see GetIndexOpClass() got renamed to ResolveOpClass(). I find the
new name rather strange - all other similar functions start with "Get",
so I guess "GetOpClass()" would be better. But I wonder if the rename
was even necessary, considering that it still deals with index operator
classes (although now also in context of partition keys). If the rename
really is needed, isn't that a sign that the function does not belong to
indexcmds.c anymore?

5) Half the error messages use 'child table' while the other half uses
'partition'. I think we should be using 'partition' as child tables
really have little meaning outside inheritance (which is kinda hidden
behind the new partitioning stuff).

6) The psql tab-completion seems a bit broken, because it only offers
the partitions, not the parent table. Which is usually exactly the
opposite of what the user wants.

testing
-------

I've also done quite a bit of testing with different partition layouts
(single-level list/range partitioning, multi-level partitioning etc.),
with fairly large number (~10k) of partitions. The scripts I used are
available here: https://bitbucket.org/tvondra/partitioning-tests

1) There seems to be an issue when a partition is created and then
accessed within the same transaction, i.e. for example

BEGIN;
... create parent ...
... create partition ....
... insert into parent ...
COMMIT;

which simply fails with an error like this:

ERROR: no partition of relation "range_test_single" found for row
DETAIL: Failing row contains (99000, 99000).

Of course, the partition is there. And interestingly enough, this works
perfectly fine when executed without the explicit transaction, so I
assume it's some sort of cache invalidation mix-up.

2) When doing a SELECT COUNT(*) from the partitioned table, I get a plan
like this:

QUERY PLAN
-------------------------------------------------------------------
Finalize Aggregate (cost=124523.64..124523.65 rows=1 width=8)
-> Gather (cost=124523.53..124523.64 rows=1 width=8)
Workers Planned: 1
-> Partial Aggregate (cost=123523.53..123523.54 rows=1 ...)
-> Append (cost=0.00..108823.53 rows=5880001 width=0)
-> Parallel Seq Scan on parent ...
-> Parallel Seq Scan on partition_1 ...
-> Parallel Seq Scan on partition_2 ...
-> Parallel Seq Scan on partition_3 ...
-> Parallel Seq Scan on partition_4 ...
-> Parallel Seq Scan on partition_5 ...
-> Parallel Seq Scan on partition_6 ...
... and the rest of the 10k partitions

So if I understand the plan correctly, we first do a parallel scan of
the parent, then partition_1, partition_2 etc. But AFAIK we scan the
tables in Append sequentially, and each partition only has 1000 rows
each, making the parallel execution rather inefficient. Unless we scan
the partitions concurrently.

In any case, as this happens even with plain inheritance, it's probably
more about the parallel query than about the new partitioning patch.

3) The last issue I noticed is that while

EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;

works just fine (it takes fair amount of time to plan with 10k
partitions, but that's expected), this

EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;

allocates a lot of memory (~8GB on my laptop, before it gets killed by
OOM killer). Again, the same thing happens with plain inheritance-based
partitioning, so it's probably not a bug in the partitioning patch.

I'm mentioning it here because I think the new partitioning will
hopefully get more efficient and handle large partition counts more
efficiently (the inheritance only really works for ~100 partitions,
which is probably why no one complained about OOM during UPDATEs). Of
course, 10k partitions is a bit extreme (good for testing, though).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-12-12 02:42:33 Re: snapbuild woes
Previous Message Jim Nasby 2016-12-11 18:52:57 Re: building HEAD on macos fails with #error no source of random numbers configured