v12 "won't fix" item regarding memory leak in "ATTACH PARTITION without AEL"; (or, relcache ref counting)

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>, Amit Langote <amitlangote09(at)gmail(dot)com>, amul sul <sulamul(at)gmail(dot)com>
Subject: v12 "won't fix" item regarding memory leak in "ATTACH PARTITION without AEL"; (or, relcache ref counting)
Date: 2020-02-23 22:01:43
Message-ID: 20200223220143.GA31506@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This links to a long thread, from which I've tried to quote some of the
most important mails, below.
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Won.27t_Fix

I wondered if there's an effort to pursue a resolution for v13 ?

On Fri, Apr 12, 2019 at 11:42:24AM -0400, Tom Lane wrote in <31027(dot)1555083744(at)sss(dot)pgh(dot)pa(dot)us>:
> Michael Paquier <michael(at)paquier(dot)xyz> writes:
> > On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote:
> >> The problem lies in all branches that have partitioning, so it should be
> >> listed under Older Bugs, right? You may have noticed that I posted
> >> patches for all branches down to 10.
>
> > I have noticed. The message from Tom upthread outlined that an open
> > item should be added, but this is not one. That's what I wanted to
> > emphasize. Thanks for making it clearer.
>
> To clarify a bit: there's more than one problem here. Amit added an
> open item about pre-existing leaks related to rd_partcheck. (I'm going
> to review and hopefully push his fix for that today.) However, there's
> a completely separate leak associated with mismanagement of rd_pdcxt,
> as I showed in [1], and it doesn't seem like we have consensus about
> what to do about that one. AFAIK that's a new bug in 12 (caused by
> 898e5e329) and so it ought to be in the main open-items list.
>
> This thread has discussed a bunch of possible future changes like
> trying to replace copying of relcache-provided data structures
> with reference-counting, but I don't think any such thing could be
> v12 material at this point. We do need to fix the newly added
> leak though.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/10797.1552679128%40sss.pgh.pa.us
>
>

On Fri, Mar 15, 2019 at 05:41:47PM -0400, Robert Haas wrote in <CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw(at)mail(dot)gmail(dot)com>:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > More to the point, we turned *one* rebuild = false situation into
> > a bunch of rebuild = true situations. I haven't studied it closely,
> > but I think a CCA animal would probably see O(N^2) rebuild = true
> > invocations in a query with N partitions, since each time
> > expand_partitioned_rtentry opens another child table, we'll incur
> > an AcceptInvalidationMessages call which leads to forced rebuilds
> > of the previously-pinned rels. In non-CCA situations, almost always
> > nothing happens with the previously-examined relcache entries.
>
> That's rather unfortunate. I start to think that clobbering the cache
> "always" is too hard a line.
>
> > I agree that copying data isn't great. What I don't agree is that this
> > solution is better. In particular, copying data out of the relcache
> > rather than expecting the relcache to hold still over long periods
> > is the way we've done things everywhere else, cf RelationGetIndexList,
> > RelationGetStatExtList, RelationGetIndexExpressions,
> > RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
> > RelationGetExclusionInfo, GetRelationPublicationActions. I don't care
> > for a patch randomly deciding to do things differently on the basis of an
> > unsupported-by-evidence argument that it might cost too much to copy the
> > data. If we're going to make a push to reduce the amount of copying of
> > that sort that we do, it should be a separately (and carefully) designed
> > thing that applies to all the relcache substructures that have the issue,
> > not one-off hacks that haven't been reviewed thoroughly.
>
> That's not an unreasonable argument. On the other hand, if you never
> try new stuff, you lose opportunities to explore things that might
> turn out to be better and worth adopting more widely.
>
> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex. You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."
>
> There is another difference as well: there's a good chance that
> somebody is going to want to mutate a Bitmapset, whereas they had
> BETTER NOT think that they can mutate the PartitionDesc. So returning
> an uncopied Bitmapset is kinda risky in a way that returning an
> uncopied PartitionDesc is not.
>
> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning. (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't? The only alternative seems to be
> to have each subsystem hold its own reference count, as I did with the
> PartitionDirectory stuff, which is not something we'd want to scale
> out.
>
> > I especially don't care for the much-less-than-half-baked kluge of
> > chaining the old rd_pdcxt onto the new one and hoping that it will go away
> > at a suitable time.
>
> It will go away at a suitable time, but maybe not at the soonest
> suitable time. It wouldn't be hard to improve that, though. The
> easiest thing to do, I think, would be to have an rd_oldpdcxt and
> stuff the old context there. If there already is one there, parent
> the new one under it. When RelationDecrementReferenceCount reduces
> the reference count to zero, destroy anything found in rd_oldpdcxt.
> With that change, things get destroyed at the earliest time at which
> we know the old things aren't referenced, instead of the earliest time
> at which they are not referenced + an invalidation arrives.
>
> > regression=# create table parent (a text, b int) partition by list (a);
> > CREATE TABLE
> > regression=# create table child (a text, b int);
> > CREATE TABLE
> > regression=# do $$
> > regression$# begin
> > regression$# for i in 1..10000000 loop
> > regression$# alter table parent attach partition child for values in ('x');
> > regression$# alter table parent detach partition child;
> > regression$# end loop;
> > regression$# end $$;
>
> Interesting example.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

On Sun, Apr 14, 2019 at 03:29:26PM -0400, Tom Lane wrote in <27380(dot)1555270166(at)sss(dot)pgh(dot)pa(dot)us>:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I agree that copying data isn't great. What I don't agree is that this
> >> solution is better.
>
> > I am not very convinced that it makes sense to lump something like
> > RelationGetIndexAttrBitmap in with something like
> > RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a
> > single Bitmapset, whereas the data structure RelationGetPartitionDesc
> > is vastly larger and more complex. You can't say "well, if it's best
> > to copy 32 bytes of data out of the relcache every time we need it, it
> > must also be right to copy 10k or 100k of data out of the relcache
> > every time we need it."
>
> I did not say that. What I did say is that they're both correct
> solutions. Copying large data structures is clearly a potential
> performance problem, but that doesn't mean we can take correctness
> shortcuts.
>
> > If we want an at-least-somewhat unified solution here, I think we need
> > to bite the bullet and make the planner hold a reference to the
> > relcache throughout planning. (The executor already keeps it open, I
> > believe.) Otherwise, how is the relcache supposed to know when it can
> > throw stuff away and when it can't?
>
> The real problem with that is that we *still* won't know whether it's
> okay to throw stuff away or not. The issue with these subsidiary
> data structures is exactly that we're trying to reduce the lock strength
> required for changing one of them, and as soon as you make that lock
> strength less than AEL, you have the problem that that value may need
> to change in relcache entries despite them being pinned. The code I'm
> complaining about is trying to devise a shortcut solution to exactly
> that situation ... and it's not a good shortcut.
>
> > The only alternative seems to be to have each subsystem hold its own
> > reference count, as I did with the PartitionDirectory stuff, which is
> > not something we'd want to scale out.
>
> Well, we clearly don't want to devise a separate solution for every
> subsystem, but it doesn't seem out of the question to me that we could
> build a "reference counted cache sub-structure" mechanism and apply it
> to each of these relcache fields. Maybe it could be unified with the
> existing code in the typcache that does a similar thing. Sure, this'd
> be a fair amount of work, but we've done it before. Syscache entries
> didn't use to have reference counts, for example.
>
> BTW, the problem I have with the PartitionDirectory stuff is exactly
> that it *isn't* a reference-counted solution. If it were, we'd not
> have this problem of not knowing when we could free rd_pdcxt.
>
> >> I especially don't care for the much-less-than-half-baked kluge of
> >> chaining the old rd_pdcxt onto the new one and hoping that it will go away
> >> at a suitable time.
>
> > It will go away at a suitable time, but maybe not at the soonest
> > suitable time. It wouldn't be hard to improve that, though. The
> > easiest thing to do, I think, would be to have an rd_oldpdcxt and
> > stuff the old context there. If there already is one there, parent
> > the new one under it. When RelationDecrementReferenceCount reduces
> > the reference count to zero, destroy anything found in rd_oldpdcxt.
>
> Meh. While it seems likely that that would mask most practical problems,
> it still seems like covering up a wart with a dirty bandage. In
> particular, that fix doesn't fix anything unless relcache reference counts
> go to zero pretty quickly --- which I'll just note is directly contrary
> to your enthusiasm for holding relcache pins longer.
>
> I think that what we ought to do for v12 is have PartitionDirectory
> copy the data, and then in v13 work on creating real reference-count
> infrastructure that would allow eliminating the copy steps with full
> safety. The $64 question is whether that really would cause unacceptable
> performance problems. To look into that, I made the attached WIP patches.
> (These are functionally complete, but I didn't bother for instance with
> removing the hunk that 898e5e329 added to relcache.c, and the comments
> need work, etc.) The first one just changes the PartitionDirectory
> code to do that, and then the second one micro-optimizes
> partition_bounds_copy() to make it somewhat less expensive, mostly by
> collapsing lots of small palloc's into one big one.
>
> What I get for test cases like [1] is
>
> single-partition SELECT, hash partitioning:
>
> N tps, HEAD tps, patch
> 2 11426.243754 11448.615193
> 8 11254.833267 11374.278861
> 32 11288.329114 11371.942425
> 128 11222.329256 11185.845258
> 512 11001.177137 10572.917288
> 1024 10612.456470 9834.172965
> 4096 8819.110195 7021.864625
> 8192 7372.611355 5276.130161
>
> single-partition SELECT, range partitioning:
>
> N tps, HEAD tps, patch
> 2 11037.855338 11153.595860
> 8 11085.218022 11019.132341
> 32 10994.348207 10935.719951
> 128 10884.417324 10532.685237
> 512 10635.583411 9578.108915
> 1024 10407.286414 8689.585136
> 4096 8361.463829 5139.084405
> 8192 7075.880701 3442.542768
>
> Now certainly these numbers suggest that avoiding the copy could be worth
> our trouble, but these results are still several orders of magnitude
> better than where we were two weeks ago [2]. Plus, this is an extreme
> case that's not really representative of real-world usage, since the test
> tables have neither indexes nor any data. In practical situations the
> baseline would be lower and the dropoff less bad. So I don't feel bad
> about shipping v12 with these sorts of numbers and hoping for more
> improvement later.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us
>
> [2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24
>

On Wed, May 01, 2019 at 01:09:07PM -0400, Robert Haas wrote in <CA+Tgmob-cska+-WUC7T-G4zkSJp7yum6M_bzYd4YFzwQ51qiow(at)mail(dot)gmail(dot)com>:
> On Wed, May 1, 2019 at 11:59 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The message I'm replying to is marked as an open item. Robert, what do
> > you think needs to be done here before release? Could you summarize,
> > so we then can see what others think?
>
> The original issue on this thread was that hyrax started running out
> of memory when it hadn't been doing so previously. That happened
> because, for complicated reasons, commit
> 898e5e3290a72d288923260143930fb32036c00c resulted in the leak being
> hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once.
> Commits 2455ab48844c90419714e27eafd235a85de23232 and
> d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem.
>
> In the email at issue, Tom complains about two things. First, he
> complains about the fact that I try to arrange things so that relcache
> data remains valid for as long as required instead of just copying it.
> Second, he complains about the fact repeated ATTACH and DETACH
> PARTITION operations can produce a slow session-lifespan memory leak.
>
> I think it's reasonable to fix the second issue for v12. I am not
> sure how important it is, because (1) the leak is small, (2) it seems
> unlikely that anyone would execute enough ATTACH/DETACH PARTITION
> operations in one backend for the leak to amount to anything
> significant, and (3) if a relcache flush ever happens when you don't
> have the relation open, all of the "leaked" memory will be un-leaked.
> However, I believe that we could fix it as follows. First, invent
> rd_pdoldcxt and put the first old context there; if that pointer is
> already in use, then parent the new context under the old one.
> Second, in RelationDecrementReferenceCount, if the refcount hits 0,
> nuke rd_pdoldcxt and set the pointer back to NULL. With that change,
> you would only keep the old PartitionDesc around until the ref count
> hits 0, whereas at present, you keep the old PartitionDesc around
> until an invalidation happens while the ref count is 0.
>
> I think the first issue is not v12 material. Tom proposed to fix it
> by copying all the relevant data out of the relcache, but his own
> performance results show a pretty significant hit, and AFAICS he
> hasn't pointed to anything that's actually broken by the current
> approach. What I think should be done is actually generalize the
> approach I took in this patch, so that instead of the partition
> directory holding a reference count, the planner or executor holds
> one. Then not only could people who want information about the
> PartitionDesc avoid copying stuff from the relcache, but maybe other
> things as well. I think that would be significantly better than
> continuing to double-down on the current copy-everything approach,
> which really only works well in a world where a table can't have all
> that much metadata, which is clearly not true for PostgreSQL any more.
> I'm not sure that Tom is actually opposed to this approach -- although
> I may have misunderstood his position -- but where we disagree, I
> think, is that I see what I did in this commit as a stepping-stone
> toward a better world, and he sees it as something that should be
> killed with fire until that better world has fully arrived.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>

On Tue, Jun 11, 2019 at 01:57:16PM -0400, Tom Lane wrote in <18286(dot)1560275836(at)sss(dot)pgh(dot)pa(dot)us>:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> >> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
> >> which seems to fix this issue for me.
>
> > Yeah, that looks right. I think my patch was full of fuzzy thinking
> > and inadequate testing; thanks for checking it over and coming up with
> > the right solution.
>
> > Anyone else want to look/comment?
>
> I think the existing code is horribly ugly and this is even worse.
> It adds cycles to RelationDecrementReferenceCount which is a hotspot
> that has no business dealing with this; the invariants are unclear;
> and there's no strong reason to think there aren't still cases where
> we accumulate lots of copies of old partition descriptors during a
> sequence of operations. Basically you're just doubling down on a
> wrong design.
>
> As I said upthread, my current inclination is to do nothing in this
> area for v12 and then try to replace the whole thing with proper
> reference counting in v13. I think the cases where we have a major
> leak are corner-case-ish enough that we can leave it as-is for one
> release.
>
> regards, tom lane
>
>

On Wed, Jun 12, 2019 at 03:11:56PM -0400, Tom Lane wrote in <3800(dot)1560366716(at)sss(dot)pgh(dot)pa(dot)us>:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > I think the change is responsive to your previous complaint that the
> > timing of stuff getting freed is not very well pinned down. With this
> > change, it's much more tightly pinned down: it happens when the
> > refcount goes to 0. That is definitely not perfect, but I think that
> > it is a lot easier to come up with scenarios where the leak
> > accumulates because no cache flush happens while the relfcount is 0
> > than it is to come up with scenarios where the refcount never reaches
> > 0. I agree that the latter type of scenario probably exists, but I
> > don't think we've come up with one yet.
>
> I don't know why you think that's improbable, given that the changes
> around PartitionDirectory-s cause relcache entries to be held open much
> longer than before (something I've also objected to on this thread).
>
> >> As I said upthread, my current inclination is to do nothing in this
> >> area for v12 and then try to replace the whole thing with proper
> >> reference counting in v13. I think the cases where we have a major
> >> leak are corner-case-ish enough that we can leave it as-is for one
> >> release.
>
> > Is this something you're planning to work on yourself?
>
> Well, I'd rather farm it out to somebody else, but ...
>
> > Do you have a
> > design in mind? Is the idea to reference-count the PartitionDesc?
>
> What we discussed upthread was refcounting each of the various
> large sub-objects of relcache entries, not just the partdesc.
> I think if we're going to go this way we should bite the bullet and fix
> them all. I really want to burn down RememberToFreeTupleDescAtEOX() in
> particular ... it seems likely to me that that's also a source of
> unpleasant memory leaks.
>
> regards, tom lane
>
>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-02-23 23:24:07 Re: pgsql: Add kqueue(2) support to the WaitEventSet API.
Previous Message Tom Lane 2020-02-23 22:00:50 Re: Ought binary literals to allow spaces?