Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Dmitry Ivanov <d(dot)ivanov(at)postgrespro(dot)ru>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Erik Rijkers <er(at)xs4all(dot)nl>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Declarative partitioning - another take
Date: 2016-12-26 10:46:41
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry about the delay in replying.

On 2016/12/23 8:08, Robert Haas wrote:
> On Thu, Dec 22, 2016 at 3:35 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> While working on that, I discovered yet-another-bug having to do with the
>> tuple descriptor that's used as we route a tuple down a partition tree. If
>> attnums of given key attribute(s) are different on different levels, it
>> would be incorrect to use the original slot's (one passed by ExecInsert())
>> tuple descriptor to inspect the original slot's heap tuple, as we go down
>> the tree. It might cause spurious "partition not found" at some level due
>> to looking at incorrect field in the input tuple because of using the
>> wrong tuple descriptor (root table's attnums not always same as other
>> partitioned tables in the tree). Patch 0001 fixes that including a test.
> I committed this, but I'm a bit uncomfortable with it: should the
> TupleTableSlot be part of the ModifyTableState rather than the EState?

Done that way in 0001 of the attached patches. So, instead of making the
standalone partition_tuple_slot a field of EState (with the actual
TupleTableSlot in its tupleTable), it is now allocated within
ModifyTableState and CopyState, and released when ModifyTable node or
CopyFrom finishes, respectively.

>> It also addresses the problem I mentioned previously that once
>> tuple-routing is done, we failed to switch to a slot with the leaf
>> partition's tupdesc (IOW, continued to use the original slot with root
>> table's tupdesc causing spurious failures due to differences in attums
>> between the leaf partition and the root table).
>> Further patches 0002, 0003 and 0004 fix bugs that I sent one-big-patch for
>> in my previous message. Each patch has a test for the bug it's meant to fix.
> Regarding 0002, I think that this is kind of a strange fix. Wouldn't
> it be better to get hold of the original tuple instead of reversing
> the conversion? And what of the idea of avoiding the conversion in
> the (probably very common) case where we can?

To get hold of the original tuple, how about adding an argument orig_slot
to ExecConstraints()? I've implemented that approach in the new 0002.

Regarding the possibility of avoiding the conversion in very common cases,
I think that could be done considering the following: If the mapping from
the attribute numbers of the parent table to that of a child table is an
identity map, we don't need to convert tuples. Currently however,
convert_tuples_by_name() also requires tdtypeid of the input and output
TupleDescs to be equal. The reason cited for that is that we may fail to
"inject the right OID into the tuple datum" if the types don't match. In
case of partitioning, hasoid status must match between the parent and its
partitions at all times, so the aforementioned condition is satisfied
without requiring that tdtypeid are same. And oid column (if present) is
always located at a given position in HeapTuple, so need not map that.

Based on the above argument, patch 0006 teaches convert_tuples_by_name()
to *optionally* not require tdtypeid of input and output tuple descriptors
to be equal. It's implemented by introducing a new argument to
convert_tuples_by_name() named 'consider_typeid'. We pass 'false' only
for the partitioning cases.

(Perhaps, the following should be its own new thread)

I noticed that ExecProcessReturning() doesn't work properly after tuple
routing (example shows how returning tableoid currently fails but I
mention some other issues below):

create table p (a int, b int) partition by range (a);
create table p1 partition of p for values from (1) to (10);
insert into p values (1) returning tableoid::regclass, *;
tableoid | a | b
- | 1 |
(1 row)


I tried to fix that in 0007 to get:

insert into p values (1) returning tableoid::regclass, *;
tableoid | a | b
p | 1 |
(1 row)


But I think it *may* be wrong to return the root table OID for tuples
inserted into leaf partitions, because with select we get partition OIDs:

select tableoid::regclass, * from p;
tableoid | a | b
p1 | 1 |
(1 row)

If so, that means we should build the projection info (corresponding to
the returning list) for each target partition somehow. ISTM, that's going
to have to be done within the planner by appropriate inheritance
translation of the original returning targetlist.


Attachment Content-Type Size
0001-Allocate-partition_tuple_slot-in-respective-nodes.patch text/x-diff 9.2 KB
0002-Make-ExecConstraints-show-the-correct-row-in-error-m.patch text/x-diff 13.7 KB
0003-Fix-a-bug-in-how-we-generate-partition-constraints.patch text/x-diff 14.2 KB
0004-Fix-a-bug-of-insertion-into-an-internal-partition.patch text/x-diff 6.7 KB
0005-Add-some-more-tests-for-tuple-routing.patch text/x-diff 5.2 KB
0006-Avoid-tuple-coversion-in-common-partitioning-cases.patch text/x-diff 5.0 KB
0007-Fix-RETURNING-to-work-correctly-after-tuple-routing.patch text/x-diff 2.6 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2016-12-26 12:08:25 Re: proposal: session server side variables
Previous Message Amit Langote 2016-12-26 10:11:42 Re: ALTER TABLE parent SET WITHOUT OIDS and the oid column