Re: Server crash (FailedAssertion) due to catcache refcount mis-handling

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server crash (FailedAssertion) due to catcache refcount mis-handling
Date: 2017-08-08 15:54:26
Message-ID: CA+Tgmob2LxK1vdGmt_nA_B4DW7PzcZoa+Ece4_VihshUvuyoog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 8, 2017 at 11:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> By looking at the stack-trace, and as discussed it with my team members;
>> what we have observed that in SearchCatCacheList(), we are incrementing
>> refcount and then decrementing it at the end. However for some reason, if
>> we are in TRY() block (where we increment the refcount), and hit with any
>> interrupt, we failed to decrement the refcount due to which later we get
>> assertion failure.
>
> Hm. So SearchCatCacheList has a PG_TRY block that is meant to release
> those refcounts, but if you hit the backend with a SIGTERM while it's
> in that function, control goes out through elog(FATAL) which doesn't
> execute the PG_CATCH cleanup. But it does do AbortTransaction which
> calls AtEOXact_CatCache, and that is expecting that all the cache
> refcounts have reached zero.
>
> We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead
> of plain PG_TRY. But I have an itchy feeling that there may be a lot
> of places with similar issues. Should we be revisiting the basic way
> that elog(FATAL) works, to make it less unlike elog(ERROR)?

I'm not sure what the technically best fix here is. It strikes me
that the charter of FATAL ought to be to exit the backend cleanly,
safely, and expeditiously. PG_ENSURE_ERROR_CLEANUP, or some similar
mechanism, is necessary when we need to correct the contents of shared
memory so that other backends can continue to function, but there's no
such problem here. You're proposing to do more work in the FATAL
pathway so that the debugging cross-checks in the FATAL pathway will
pass, which is not an entirely palatable idea. It's definitely
frustrating that the ERROR and FATAL pathways are so different - that
generated a surprising amount of the work around DSM - but I'm still
skeptical about a solution that involves doing more cleanup of what
are essentially irrelevant backend-local data structures in the FATAL
path. Arguably, this assertion is merely overzealous; we could skip
this processing when proc_exit_inprogress, for example.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sandeep Thakkar 2017-08-08 16:15:35 Re: pl/perl extension fails on Windows
Previous Message amul sul 2017-08-08 15:45:38 Re: reload-through-the-top-parent switch the partition table