Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Declarative partitioning - another take
Date: 2016-12-12 06:37:17
Message-ID: 99afbfc0-1cff-fbc8-6361-be8d1d7c6fb2@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers


Hi Tomas,

On 2016/12/12 10:02, Tomas Vondra wrote:
> On 12/07/2016 07:20 PM, Robert Haas wrote:
>> I've committed 0001 - 0006 with that correction and a few other
>> adjustments. There's plenty of room for improvement here, and almost
>> certainly some straight-up bugs too, but I think we're at a point
>> where it will be easier and less error-prone to commit follow on
>> changes incrementally rather than by continuously re-reviewing a very
>> large patch set for increasingly smaller changes.
>>
>
> I've been working on a review / testing of the partitioning patch, but
> have been unable to submit it before the commit due to a lot of travel.
> However, at least some of the points seem to be still valid, so let me
> share it as an after-commit review. Most of the issues are fairly minor
> (possibly even nitpicking).

Thanks a lot for the review comments and testing!

I attach a patch that implements some of the fixes that you suggest below.
Also, I am merging a patch I sent earlier [1] to avoid having to look at
multiple patches.

> review
> ------
>
> 1) src/include/catalog/pg_partitioned_table.h contains this bit:
>
> * $PostgreSQL: pgsql/src/include/catalog/pg_partitioned_table.h $

Fixed.

> 2) I'm wondering whether having 'table' in the catalog name (and also in
> the new relkind) is too limiting. I assume we'll have partitioned indexes
> one day, for example - do we expect to use the same catalogs?

I am not sure I understand your idea of partitioned indexes, but I doubt
it would require entries in the catalog under consideration. Could you
perhaps elaborate more?

> 3) A comment within BeginCopy (in copy.c) says:
>
> * Initialize state for CopyFrom tuple routing. Watch out for
> * any foreign partitions.
>
> But the code does not seem to be doing that (at least I don't see any
> obvious checks for foreign partitions). Also, the comment should probably
> at least mention why foreign partitions need extra care.

It's a stale comment after I took out the code that handled foreign
partitions in final versions of the tuple-routing patch. You may have
noticed in the commit message that tuple-routing does not work for foreign
partitions, because the changes I'd proposed weren't that good. Comment
fixed anyway.

> To nitpick, the 'pd' variable in that block seems rather pointless - we
> can assign directly to cstate->partition_dispatch_info.

OK, fixed. Also, fixed similar code in ExecInitModifyTable().

> 4) I see GetIndexOpClass() got renamed to ResolveOpClass(). I find the new
> name rather strange - all other similar functions start with "Get", so I
> guess "GetOpClass()" would be better. But I wonder if the rename was even
> necessary, considering that it still deals with index operator classes
> (although now also in context of partition keys). If the rename really is
> needed, isn't that a sign that the function does not belong to indexcmds.c
> anymore?

The fact that both index keys and partition keys have very similar
specification syntax means there would be some common code handling the
same, of which this is one (maybe, only) example. I didn't see much point
in finding a new place for the function, although I can see why it may be
a bit confusing to future readers of the code.

About the new name - seeing the top line in the old header comment, what
the function does is resolve a possibly explicitly specified operator
class name to an operator class OID. I hadn't changed the name in my
original patch, but Robert suggested the rename, which I thought made sense.

> 5) Half the error messages use 'child table' while the other half uses
> 'partition'. I think we should be using 'partition' as child tables really
> have little meaning outside inheritance (which is kinda hidden behind the
> new partitioning stuff).

One way to go about it may be to check all sites that can possibly report
an error involving child tables (aka "partitions") whether they know from
the context which name to use. I think it's possible, because we have
access to the parent relation in all such sites and looking at the relkind
can tell whether to call child tables "partitions".

> 6) The psql tab-completion seems a bit broken, because it only offers the
> partitions, not the parent table. Which is usually exactly the opposite of
> what the user wants.

Actually, I have not implemented any support for tab-completion for the
new DDL. I will work on that.

> testing
> -------
>
> I've also done quite a bit of testing with different partition layouts
> (single-level list/range partitioning, multi-level partitioning etc.),
> with fairly large number (~10k) of partitions. The scripts I used are
> available here: https://bitbucket.org/tvondra/partitioning-tests
>
> 1) There seems to be an issue when a partition is created and then
> accessed within the same transaction, i.e. for example
>
> BEGIN;
> ... create parent ...
> ... create partition ....
> ... insert into parent ...
> COMMIT;
>
> which simply fails with an error like this:
>
> ERROR: no partition of relation "range_test_single" found for row
> DETAIL: Failing row contains (99000, 99000).
>
> Of course, the partition is there. And interestingly enough, this works
> perfectly fine when executed without the explicit transaction, so I assume
> it's some sort of cache invalidation mix-up.

That's a clearly bug. A relcache invalidation on the parent was missing
in create table ... partition of code path. Fixed.

> 2) When doing a SELECT COUNT(*) from the partitioned table, I get a plan
> like this:
>
> QUERY PLAN
> -------------------------------------------------------------------
> Finalize Aggregate (cost=124523.64..124523.65 rows=1 width=8)
> -> Gather (cost=124523.53..124523.64 rows=1 width=8)
> Workers Planned: 1
> -> Partial Aggregate (cost=123523.53..123523.54 rows=1 ...)
> -> Append (cost=0.00..108823.53 rows=5880001 width=0)
> -> Parallel Seq Scan on parent ...
> -> Parallel Seq Scan on partition_1 ...
> -> Parallel Seq Scan on partition_2 ...
> -> Parallel Seq Scan on partition_3 ...
> -> Parallel Seq Scan on partition_4 ...
> -> Parallel Seq Scan on partition_5 ...
> -> Parallel Seq Scan on partition_6 ...
> ... and the rest of the 10k partitions
>
> So if I understand the plan correctly, we first do a parallel scan of the
> parent, then partition_1, partition_2 etc. But AFAIK we scan the tables in
> Append sequentially, and each partition only has 1000 rows each, making
> the parallel execution rather inefficient. Unless we scan the partitions
> concurrently.
>
> In any case, as this happens even with plain inheritance, it's probably
> more about the parallel query than about the new partitioning patch.

Yes, I have seen some discussion [2] about a Parallel Append, which would
result in plans you're probably thinking of.

> 3) The last issue I noticed is that while
>
> EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;
>
> works just fine (it takes fair amount of time to plan with 10k partitions,
> but that's expected), this
>
> EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;
>
> allocates a lot of memory (~8GB on my laptop, before it gets killed by OOM
> killer). Again, the same thing happens with plain inheritance-based
> partitioning, so it's probably not a bug in the partitioning patch.
>
> I'm mentioning it here because I think the new partitioning will hopefully
> get more efficient and handle large partition counts more efficiently (the
> inheritance only really works for ~100 partitions, which is probably why
> no one complained about OOM during UPDATEs). Of course, 10k partitions is
> a bit extreme (good for testing, though).

Plans with the inheritance parents as target tables (update/delete) go
through inheritance_planner() in the optimizer, which currently has a
design that is based on certain assumptions about traditional inheritance.
It's possible hopefully to make it less expensive for the partitioned tables.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/c9737515-36ba-6a42-f788-1b8867e2dd38%40lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/83755E07-C435-493F-9F93-F727604D66A1%40gmail.com

Attachment Content-Type Size
misc-code-fixes-1.patch text/x-diff 8.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2016-12-12 06:40:24 Parallel Index-only scan
Previous Message Heikki Linnakangas 2016-12-12 06:34:27 Re: Broken SSL tests in master