Re: Partitioned tables and relfilenode

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partitioned tables and relfilenode
Date: 2017-03-14 22:09:12
Message-ID: CA+Tgmoa4f+0E5sHnNuHZ2vCH2xQb94dwqUdN0NniXFFxm-W1LA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> The previous proposal was for expand_inherited_rtentry to not create RT
> entries and AppendRelInfo's for the non-leaf tables, but I think that
> doesn't work, as I tried to explain above. We need RTEs because that
> seems to be the only way currently for informing the executor of the
> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those
> RTEs for the latter planning steps to collect them in partitioned_rels
> mentioned above. So with the latest patch, we do create the RT entry and
> AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is
> a minimal one; only parent_relid and child_relid are valid. To make the
> latter planning steps ignore these minimal AppendRelInfo's, every
> AppendRelInfo is now marked with child_relkind. Only
> set_append_rel_pathlist() and inheritance_planner() process them to
> collect the child_relid into the partitioned_rels list to be stored in
> AppendPath/MergeAppendPath and ModifyTablePath, respectively.

I see your point, but I still think this kind of stinks. You've got
all kinds of logic that is now conditional on child_is_partitioned,
and that seems like a recipe for bugs and future maintenance
difficulties. It would be much nicer if we could come up with a
design that doesn't create the AppendRelInfo in the first place,
because then all of that stuff could just work. Can we get away with
creating an RTE for each partitioned table (other than the parent,
perhaps; for that one it would be nice to use the inh-flagged RTE we
already have) but NOT creating an AppendRelInfo to go with it? If
we've got the list of RTIs for the new RTEs associated with the append
path in some other form, can't we get by without also having an
AppendRelInfo to hold onto that translation?

The comments in InitPlan() explain why locks on result relations are
taken in that function directly rather than during the ExecInitNode
pass over the tree; it's because we need to make sure we take the
strongest lock on any given relation first. But the changes in
ExecInitAppend and ExecInitMergeAppend are problematic in that regard;
some AccessShareLocks may already have been taken, and now we're
taking locks that in some case may be RowShareLock, which could cause
a lock upgrade. Remember that there's no reason that you couldn't
join a table to one of its own partitions, or something like that. I
think you need to try to jigger things so that InitPlan() takes all
locks stronger than AccessShareLock that are required anywhere in the
query, and then other nodes can take anything at AccessShareLock that
they need.

I think that eliding the Append node when there's only one child may
be unsafe in the case where the child's attribute numbers are
different from the parent's attribute numbers. I remember Tom making
some comment about this when I was working on MergeAppend, although I
no longer remember the specific details.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-03-14 22:10:46 Re: GUC for cleanup indexes threshold.
Previous Message Alvaro Herrera 2017-03-14 22:08:32 Re: multivariate statistics (v25)