Re: Server crash (FailedAssertion) due to catcache refcount mis-handling

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server crash (FailedAssertion) due to catcache refcount mis-handling
Date: 2017-08-08 17:20:15
Message-ID: CA+TgmoYGz7ugdu0=Wf0WJ4fthBLrrWCUfAjGJZoPft=CCeS7rQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Yeah, I thought about weakening the assertions too, but I couldn't
> see a fix of that kind that didn't seem mighty ad-hoc.

I don't really see what's ad-hoc about skipping it in the case where a
FATAL is in progress. I mean, skipping a sanity check only in the
cases where we know it might pass - and are OK with the fact that it
might not pass - seems to me to be an extremely difficult policy to
argue against on rational grounds. That's just the definition of
writing correct sanity checks.

More concretely, the present example seems no different than the
ResourceOwner stuff emitting warnings on commit and remaining silent
on abort. We could make it complain on both commit and abort, but
then it would fail spuriously because there's no other mechanism to
release resources in the abort path, so we don't. Similarly here, we
have every reason to expect the check to pass during ERROR recovery
but there is no reason to expect it to pass during FATAL recovery, yet
as coded we will do the test anyway. That's wrong.

> Now, there is some room to argue that AtEOXact_CatCache() is just
> obsolete and we should remove it altogether; I don't think it's
> caught a real bug since we invented resowners in 8.1.

Yeah, the same thought crossed my mind. IIUC, we'd crash if a
catcache reference were acquired without CurrentResourceOwner being
valid, so this is really just a belt-and-suspenders check.

> But, again, the larger question to my mind is where else we may have
> similar issues. There's certainly never been any methodical attempt
> to see whether elog(FATAL) will work everywhere.

IIRC, you raised a similar objection back when Bruce added
pg_terminate_backend(), which caused that change to be briefly
reverted before ultimately being put back. Despite the hardening you
did back then, I think it's highly likely that bugs remain in that
path, and I am of course not opposed to trying to improve the
situation. That having been said, the bugs that remain are probably
mostly quite low-probability, because otherwise we'd have found them
by now. I think parallel query is likely to flush out a few of the
ones that remain by creating a new class of backends that terminate
after a single query and may get killed by the leader at any time;
that's how we discovered this issue. Fuzz testing could be done, too,
e.g. run something like sqlsmith simultaneously in many sessions while
killing off backends at random. I'm also not deadly opposed to
redesigning the whole mechanism either, but I think that should be
approached with a measure of caution: it might end up reducing
reliability rather than increasing it. I suggest in any case that we
start with a surgical fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-08-08 18:11:31 Re: More race conditions in logical replication
Previous Message Tom Lane 2017-08-08 16:26:36 Re: Server crash (FailedAssertion) due to catcache refcount mis-handling