Re: Partitioned tables and relfilenode

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partitioned tables and relfilenode
Date: 2017-03-16 10:03:29
Message-ID: b7f2d81d-9c9b-ba4e-5e00-edf626567256@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017/03/15 7:09, Robert Haas wrote:
> On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> The previous proposal was for expand_inherited_rtentry to not create RT
>> entries and AppendRelInfo's for the non-leaf tables, but I think that
>> doesn't work, as I tried to explain above. We need RTEs because that
>> seems to be the only way currently for informing the executor of the
>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
>> RTEs for the latter planning steps to collect them in partitioned_rels
>> mentioned above. So with the latest patch, we do create the RT entry and
>> AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is
>> a minimal one; only parent_relid and child_relid are valid. To make the
>> latter planning steps ignore these minimal AppendRelInfo's, every
>> AppendRelInfo is now marked with child_relkind. Only
>> set_append_rel_pathlist() and inheritance_planner() process them to
>> collect the child_relid into the partitioned_rels list to be stored in
>> AppendPath/MergeAppendPath and ModifyTablePath, respectively.
>
> I see your point, but I still think this kind of stinks. You've got
> all kinds of logic that is now conditional on child_is_partitioned,
> and that seems like a recipe for bugs and future maintenance
> difficulties. It would be much nicer if we could come up with a
> design that doesn't create the AppendRelInfo in the first place,
> because then all of that stuff could just work. Can we get away with
> creating an RTE for each partitioned table (other than the parent,
> perhaps; for that one it would be nice to use the inh-flagged RTE we
> already have) but NOT creating an AppendRelInfo to go with it? If
> we've got the list of RTIs for the new RTEs associated with the append
> path in some other form, can't we get by without also having an
> AppendRelInfo to hold onto that translation?

I think we'll need to store *somewhere* the mapping of which inh=false
partitioned table RTE is the child of which inh=true (IOW, parent)
partitioned table RTE. I've come to think that AppendRelInfos, although
contain extraneous information that won't be used, are better than
inventing something new altogether for time being. AppendRelInfos are
referred to a few times by query_planner() steps before we eventually get
to either set_append_rel_pathlist() or inheritance_planner(), so not
changing that approach seems less worrisome for now. So now if we both
create child RTEs and AppendRelInfos for the partitioned tables, we don't
need to change expand_inherited_rtentry() at all with this patch.
Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the
child partitioned table RTEs, because no path/plan need to be created. We
can do away with having to create RelOptInfos for child partitioned table
RTEs, which I found to be not that invasive.

We seem to agree that we need to carry the partitioned table RT indexes
over to the executor somehow so that they are locked properly during
execution. To that end, Append, MergeAppend, and ModifyTable (and the
corresponding path nodes) nodes will contain a list of those RT indexes in
a field called partitioned_rels. In the result relation case, the list
from ModifyTable is copied to a field in PlannerGlobal called
nonleafResultRelations (next to the old resultRelations) and further to a
field of the same name in PlannedStmt. I considered if it can be done
without a new field, but realized that the old resultRelations is tied
closely with how the EState manages result relations.

InitPlan() and AcquireExecutorLocks() lock the tables denoted by the RT
indexes in PlannedStmt.nonleafResultRelations using RowExclusiveLock, just
like they would those in PlannedStmt.resultRelations.

Since expand_inherited_rtentries() adds PlanRowMarks for partitioned
tables just like previously, InitPlan() and AcquireExecutorLocks() will
lock them with RowShareLock if needed.

Finally, ExecInitAppend() and ExecInitMergeAppend() must acquire
AccessShareLock on the partition tables, if not already locked by InitPlan().

> The comments in InitPlan() explain why locks on result relations are
> taken in that function directly rather than during the ExecInitNode
> pass over the tree; it's because we need to make sure we take the
> strongest lock on any given relation first. But the changes in
> ExecInitAppend and ExecInitMergeAppend are problematic in that regard;
> some AccessShareLocks may already have been taken, and now we're
> taking locks that in some case may be RowShareLock, which could cause
> a lock upgrade. Remember that there's no reason that you couldn't
> join a table to one of its own partitions, or something like that. I
> think you need to try to jigger things so that InitPlan() takes all
> locks stronger than AccessShareLock that are required anywhere in the
> query, and then other nodes can take anything at AccessShareLock that
> they need.

Thanks for the explanation. If the scheme in the new patch that I
described above sounds OK, I think it takes care of locking the
partitioned tables without the upgrade hazards.

> I think that eliding the Append node when there's only one child may
> be unsafe in the case where the child's attribute numbers are
> different from the parent's attribute numbers. I remember Tom making
> some comment about this when I was working on MergeAppend, although I
> no longer remember the specific details.

Append node elision does not occur in the one-child case. With the patch:

create table q (a int) partition by list (a);
explain select * from q;
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.00 rows=0 width=4)
One-Time Filter: false
(2 rows)

create table q1 partition of q for values in (1);
explain select * from q;
QUERY PLAN
------------------------------------------------------------
Append (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on q1 (cost=0.00..35.50 rows=2550 width=4)
(2 rows)

Maybe that should be done, but this patch doesn't implement that.

Thanks,
Amit

Attachment Content-Type Size
0001-Avoid-creating-scan-nodes-for-partitioned-tables.patch text/x-diff 56.3 KB
0002-Do-not-allocate-storage-for-partitioned-tables.patch text/x-diff 3.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Page 2017-03-16 10:09:47 Re: pg_ls_waldir() & pg_ls_logdir()
Previous Message Thomas Munro 2017-03-16 09:54:59 Re: pg_ls_waldir() & pg_ls_logdir()