Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Date: 2018-07-16 20:36:51
Message-ID: 4993.1531773411@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> So I said I didn't want to do extra work on this, but I am looking into
> fixing it by having these aux process types run a ResourceOwner that can
> be told to clean up any open buffer pins at exit. We could be sure the
> coverage is complete by dint of removing the special-case code in
> resowner.c that allows buffer pins to be taken with no active resowner.
> Then CheckForBufferLeaks can be left as-is, ie something we do only
> in assert builds.

That turned out to be a larger can of worms than I'd anticipated, as it
soon emerged that we'd acquired a whole lot of cargo-cult programming
around ResourceOwners. Having a ResourceOwner in itself does nothing
for you: there has to be code someplace that ensures we'll call
ResourceOwnerRelease at an appropriate time. There was basically noplace
outside xact.c and portalmem.c that got this completely right. bgwriter.c
and a couple of other places at least tried a little, by doing a
ResourceOwnerRelease in their sigsetjmp-catching stanzas, but that didn't
account for the process-exit code path. Other places just created a
ResourceOwner and did *nothing* about cleaning it up.

I decided that the most expedient way of dealing with this was to create
a self-contained facility in resowner.c that would create a standalone
ResourceOwner and register a shmem-exit callback to clean it up.
That way it's easier (less code) to do it right than to do it wrong.
That led to the attached, which passes check-world as well as Michael's
full-disk test script.

I'm mostly pretty happy with this, but I think there are a couple of
loose ends in logicalfuncs.c and slotfuncs.c: those are creating
non-standalone ResourceOwners (children of whatever the active
ResourceOwner is) and doing nothing much to clean them up. That seems
pretty bogus. It's not a permanent resource leak, because sooner or
later the parent ResourceOwner will get cleaned up and that will
recurse to the child ... but then why bother with a child ResourceOwner
at all? I added asserts to these calls to verify that there was a
parent resowner (if there isn't, the code is just broken), but I think
that we should either add more code to clean up the child resowner
promptly, or just not bother with a child at all.

Not very sure what to do with this. I don't think we should try to
backpatch it, because it's essentially changing the API requirements
around buffer access in auxiliary processes --- but it might not be
too late to squeeze it into v11. It is a bug fix, in that the current
code can leak buffer pins in some code paths and we won't notice in
non-assert builds; but we've not really seen any field reports of that
happening, so I'm not sure how important this is.

regards, tom lane

Attachment Content-Type Size
use-resowner-in-all-aux-processes-1.patch text/x-diff 16.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-07-16 20:55:37 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Adrien Nayrat 2018-07-16 20:19:11 Re: New GUC to sample log queries