Re: hyrax vs. RelationBuildPartitionDesc

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: hyrax vs. RelationBuildPartitionDesc
Date: 2019-04-16 18:11:44
Message-ID: 2525.1555438304@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Amit Langote <amitlangote09(at)gmail(dot)com> writes:
>> I get that we want to get rid of the keep_* kludge in the long term, but
>> is it wrong to think, for example, that having keep_partdesc today allows
>> us today to keep the pointer to rd_partdesc as long as we're holding the
>> relation open or refcnt on the whole relation such as with
>> PartitionDirectory mechanism?

Well, it's safe from the caller's standpoint as long as a suitable lock is
being held, which is neither well-defined nor enforced in any way :-(

> Ah, we're also trying to fix the memory leak caused by the current
> design of PartitionDirectory. AIUI, the design assumes that the leak
> would occur in fairly rare cases, but maybe not so? If partitions are
> frequently attached/detached concurrently (maybe won't be too uncommon
> if reduced lock levels encourages users) causing the PartitionDesc of
> a given relation changing all the time, then a planning session that's
> holding the PartitionDirectory containing that relation would leak as
> many PartitionDescs as there were concurrent changes, I guess.

We should get a relcache inval after a partdesc change, but the problem
with the current code is that that will only result in freeing the old
partdesc if the inval event is processed while the relcache entry has
refcount zero. Otherwise the old rd_pdcxt is just shoved onto the
context chain, where it could survive indefinitely.

I'm not sure that this is really a huge problem in practice. The example
I gave upthread shows that a partdesc-changing transaction's own internal
invals do arrive during CommandCounterIncrement calls that occur while the
relcache pin is held; but it seems a bit artificial to assume that one
transaction would do a huge number of such changes. (Although, hm, maybe
a single-transaction pg_restore run could have an issue.) Once out of
the transaction, it's okay because we'll again invalidate the entry
at the start of the next transaction, and then the refcount will be zero
and we'll clean up. For other sessions it'd only happen if they saw the
inval while already holding a pin on the partitioned table, which probably
requires some unlucky timing; and that'd have to happen repeatedly to have
a leak that amounts to anything.

Still, though, I'm unhappy with the code as it stands. It's risky to
assume that it has no unpleasant behaviors that we haven't spotted yet
but will manifest after v12 is in the field. And I do not think that
it represents a solid base to build on. (As an example, if we made
any effort to get rid of the redundant extra inval events that occur
post-transaction, we'd suddenly have a much worse problem here.)
I'd rather go over to the copy-based solution for now, which *is*
semantically sound, and accept that we still have more performance
work to do. It's not like v12 isn't going to be light-years ahead of
v11 in this area anyway.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-04-16 18:31:25 Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Previous Message Andres Freund 2019-04-16 18:04:52 Unhappy about API changes in the no-fsm-for-small-rels patch