Re: Server crash (FailedAssertion) due to catcache refcount mis-handling

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server crash (FailedAssertion) due to catcache refcount mis-handling
Date: 2017-08-08 15:36:17
Message-ID: 4244.1502206577@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com> writes:
> We have observed a random server crash (FailedAssertion), while running few
> tests at our end. Stack-trace is attached.

> By looking at the stack-trace, and as discussed it with my team members;
> what we have observed that in SearchCatCacheList(), we are incrementing
> refcount and then decrementing it at the end. However for some reason, if
> we are in TRY() block (where we increment the refcount), and hit with any
> interrupt, we failed to decrement the refcount due to which later we get
> assertion failure.

Hm. So SearchCatCacheList has a PG_TRY block that is meant to release
those refcounts, but if you hit the backend with a SIGTERM while it's
in that function, control goes out through elog(FATAL) which doesn't
execute the PG_CATCH cleanup. But it does do AbortTransaction which
calls AtEOXact_CatCache, and that is expecting that all the cache
refcounts have reached zero.

We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead
of plain PG_TRY. But I have an itchy feeling that there may be a lot
of places with similar issues. Should we be revisiting the basic way
that elog(FATAL) works, to make it less unlike elog(ERROR)?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message amul sul 2017-08-08 15:45:38 Re: reload-through-the-top-parent switch the partition table
Previous Message Robert Haas 2017-08-08 14:49:52 Re: pl/perl extension fails on Windows