Re: Server crash (FailedAssertion) due to catcache refcount mis-handling

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server crash (FailedAssertion) due to catcache refcount mis-handling
Date: 2017-08-08 20:34:51
Message-ID: 16208.1502224491@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah, I thought about weakening the assertions too, but I couldn't
>> see a fix of that kind that didn't seem mighty ad-hoc.

> More concretely, the present example seems no different than the
> ResourceOwner stuff emitting warnings on commit and remaining silent
> on abort. We could make it complain on both commit and abort, but
> then it would fail spuriously because there's no other mechanism to
> release resources in the abort path, so we don't. Similarly here, we
> have every reason to expect the check to pass during ERROR recovery
> but there is no reason to expect it to pass during FATAL recovery, yet
> as coded we will do the test anyway. That's wrong.

I think that's arguing from expedience not principle. We had every
reason to think it would pass during FATAL errors too, until we noticed
this specific misbehavior; and there is at least as much of an argument
that this is a bug in SearchCatCacheList as there is that the check
is too strong.

>> Now, there is some room to argue that AtEOXact_CatCache() is just
>> obsolete and we should remove it altogether; I don't think it's
>> caught a real bug since we invented resowners in 8.1.

> Yeah, the same thought crossed my mind. IIUC, we'd crash if a
> catcache reference were acquired without CurrentResourceOwner being
> valid, so this is really just a belt-and-suspenders check.

Right. It was worth keeping it around till we were sure all the bugs
were shaken out of ResourceOwners, but surely we crossed the point of
diminishing returns for that long ago.

> ... I'm also not deadly opposed to
> redesigning the whole mechanism either, but I think that should be
> approached with a measure of caution: it might end up reducing
> reliability rather than increasing it. I suggest in any case that we
> start with a surgical fix.

In the absence of clear evidence that there are similar bugs elsewhere,
I agree that redesigning FATAL exits would likely cause more problems
than it solves. But I feel like it would be a good thing to test for.
I wonder if Andreas would be interested in trying the randomly-timed-
SIGTERM thing with sqlsmith.

In the meantime, I think my vote would be to remove AtEOXact_CatCache.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-08-08 21:19:30 Re: pgsql: Fix inadequacies in recently added wait events
Previous Message Andres Freund 2017-08-08 20:22:33 Infrastructure for JIT compiling tuple deforming