Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-12-07 11:42:48
Message-ID: 78ae9888-7372-245d-cf6b-f655e6fe6c12@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 2016/12/07 13:38, Robert Haas wrote:
> On Wed, Nov 30, 2016 at 10:56 AM, Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
>> The latest patch I posted earlier today has this implementation.
>
> I decided to try out these patches today with #define
> CLOBBER_CACHE_ALWAYS 1 in pg_config_manual.h, which found a couple of
> problems:
>
> 1. RelationClearRelation() wasn't preserving the rd_partkey, even
> though there's plenty of code that relies on it not changing while we
> hold a lock on the relation - in particular, transformPartitionBound.

Oh, I thought an AccessExclusiveLock on the relation would prevent having
to worry about that, but guess I'm wrong. Perhaps, having a lock on a
table does not preclude RelationClearRelation() resetting the table's
relcache.

> 2. partition_bounds_equal() was using the comparator and collation for
> partitioning column 0 to compare the datums for all partitioning
> columns. It's amazing this passed the regression tests.

Oops, it seems that the regression tests where the above code might be
exercised consisted only of range partition key with columns all of the
same type: create table test(a int, b int) partition by range (a, (a+b));

> The attached incremental patch fixes those things and some cosmetic
> issues I found along the way.

Thanks for fixing these. Attached patches include these changes.

> 3. RelationGetPartitionDispatchInfo() is badly broken:
>
> 1010 pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
> 1011 pd[i]->relid = RelationGetRelid(partrel);
> 1012 pd[i]->key = RelationGetPartitionKey(partrel);
> 1013 pd[i]->keystate = NIL;
> 1014 pd[i]->partdesc = partdesc;
> 1015 pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
> 1016 heap_close(partrel, NoLock);
> 1017
> 1018 m = 0;
> 1019 for (j = 0; j < partdesc->nparts; j++)
> 1020 {
> 1021 Oid partrelid = partdesc->oids[j];
>
> This code imagines that pointers it extracted from partrel are certain
> to remain valid after heap_close(partrel, NoLock), perhaps on the
> strength of the fact that we still retain a lock on the relation. But
> this isn't the case. As soon as nobody has the relation open, a call
> to RelationClearRelation() will destroy the relcache entry and
> everything to which it points; with CLOBBER_CACHE_ALWAYS, I see a
> failure at line 1021:
>
> #0 RelationGetPartitionDispatchInfo (rel=0x1136dddf8, lockmode=3,
> leaf_part_oids=0x7fff5633b938) at partition.c:1021
> 1021 Oid partrelid = partdesc->oids[j];
> (gdb) bt 5
> #0 RelationGetPartitionDispatchInfo (rel=0x1136dddf8, lockmode=3,
> leaf_part_oids=0x7fff5633b938) at partition.c:1021
> #1 0x0000000109b8d71f in ExecInitModifyTable (node=0x7fd12984d750,
> estate=0x7fd12b885438, eflags=0) at nodeModifyTable.c:1730
> #2 0x0000000109b5e7ac in ExecInitNode (node=0x7fd12984d750,
> estate=0x7fd12b885438, eflags=0) at execProcnode.c:159
> #3 0x0000000109b58548 in InitPlan (queryDesc=0x7fd12b87b638,
> eflags=0) at execMain.c:961
> #4 0x0000000109b57dcd in standard_ExecutorStart
> (queryDesc=0x7fd12b87b638, eflags=0) at execMain.c:239
> (More stack frames follow...)
> Current language: auto; currently minimal
> (gdb) p debug_query_string
> $1 = 0x7fd12b84c238 "insert into list_parted values (null, 1);"
> (gdb) p partdesc[0]
> $2 = {
> nparts = 2139062143,
> oids = 0x7f7f7f7f7f7f7f7f,
> boundinfo = 0x7f7f7f7f7f7f7f7f
> }
>
> As you can see, the partdesc is no longer valid here. I'm not
> immediately sure how to fix this; this isn't a simple thinko. You
> need to keep the relations open for the whole duration of the query,
> not just long enough to build the dispatch info. I think you should
> try to revise this so that each relation is opened once and kept open;
> maybe the first loop should be making a pointer-list of Relations
> rather than an int-list of relation OIDs.

Thanks for the explanation, I see the problem. I changed
PartitionDispatchData such that its 1st field is now Relation (instead of
current Oid), where we keep the relation descriptor of the corresponding
partitioned table. Currently, only the leaf relations are held open in
their respective ResultRelInfo's (ModifyTableState.mt_partitions) and
later closed in ExecEndModifyTable(). Similarly, we have the
ModifyTableState.mt_partition_dispatch_info array, each of whose members
holds open a partitioned relation using the new field. We close that too
along with leaf partitions as just mentioned. Similarly in case of COPY FROM.

> And it seems to me (though
> I'm getting a little fuzzy here because it's late) that you need all
> of the partitions open, not just the ones that are subpartitioned,
> because how else are you going to know how to remap the tuple if the
> column order is different? But I don't see this code doing that,
> which makes me wonder if the partitions are being opened yet again in
> some other location.

Actually leaf partitions are opened by the respective callers of
RelationGetPartitionDispatchInfo() and held open in the corresponding
ResultRelInfo's (ModifyTableState.mt_partitions). As mentioned above,
they will be closed by either ExecEndModifyTable() or CopyFrom() after
finishing the INSERT or COPY, respectively.

> I recommend that once you fix this, you run 'make check' with #define
> CLOBBER_CACHE_ALWAYS 1 and look for other hazards. Such mistakes are
> easy to make with this kind of patch.

With the attached latest version of the patches, I couldn't see any
failures with a CLOBBER_CACHE_ALWAYS build.

Thanks,
Amit

Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-20.patch text/x-diff 120.7 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-20.patch text/x-diff 24.0 KB
0003-Catalog-and-DDL-for-partitions-20.patch text/x-diff 212.0 KB
0004-psql-and-pg_dump-support-for-partitions-20.patch text/x-diff 22.1 KB
0005-Teach-a-few-places-to-use-partition-check-quals-20.patch text/x-diff 30.7 KB
0006-Tuple-routing-for-partitioned-tables-20.patch text/x-diff 38.0 KB
0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-20.patch text/x-diff 24.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-12-07 12:11:02 Re: Push down more full joins in postgres_fdw
Previous Message Etsuro Fujita 2016-12-07 11:23:41 Re: Push down more full joins in postgres_fdw