Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS
Date: 2014-12-15 15:49:29
Message-ID: 20141215154929.GI5023@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2014-12-15 10:15:30 -0500, Tom Lane wrote:
> The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
> contrib/test_decoding with
>
> TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981)
>
> for example here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-14%2005%3A50%3A09
> and here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-13%2021%3A26%3A05

Yes, I've been trying to debug that. I've finally managed to reproduce
locally. Unfortunately it's not directly reproducible on my laptop...

A fair amount of tinkering later I've found out that it's not actually
CLOBBER_CACHE_ALWAYS itself that triggers the problem but catchup
interrupts. It triggers with CLOBBER_CACHE_ALWAYS because that happens
to make parts of the code slow enough to not reach a ordinary
AcceptInvalidationMessages() in time.

The problem is that in the added regression test there can be a
situation where there's a relcache entry that's *not* currently visible
but still referenced. Consider:

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
CREATE TABLE somechange(id serial primary key);
INSERT INTO changeresult
SELECT data FROM pg_logical_slot_get_changes(...);

While get_changes stores it data into a tuplestore there can be a moment
where 'changeresult' has a refcnt > 0 due to the INSERT, but is
invisible because we're using a historic snapshot from when changeresult
doesn't exist.

Without catchup invalidations and/or an outside reference to a relation
that's normally not a problem because it won't get reloaded from the
caches at an inappropriate time while invisible. Until a few weeks ago
there was no regression test covering that case which is why these
crashes are only there now.

It triggers via:
RelationCacheInvalidate() ->
RelationClearRelation(relation, true) ->
/* Build temporary entry, but don't link it into hashtable */
newrel = RelationBuildDesc(save_relid, false);
if (newrel == NULL)
{
/* Should only get here if relation was deleted */
RelationCacheDelete(relation);
RelationDestroyRelation(relation, false);
elog(ERROR, "relation %u deleted while still in use", save_relid);
}

That path is only hit while refcnt > 0

RelationDestroyRelation() has
Assert(RelationHasReferenceCountZero(relation));

So that path doesn't really work... Without assertions we'd "just" get a
transient ERROR. I think that path should generally loose the
RelationDestroyRelation() - if it's referenced that's surely not the
right thing to do. I'm not actually convinced logical decoding is the
only way that assert can be reached.

Since logical decoding happens inside a subtransaction (when called via
SQL) and invalidate caches one relatively straightforward way to fix
this would be to add something roughly akin to:
Assert(!relation->rd_isvalid)
/*
* This should only happen when we're doing logical decoding where
* it can happen when [above].
*/
if (HistoricSnapshotActive())
return;

There's no chance that such a entry could survive too long in an invalid
state (minus preexisting bugs independent of decoding) since we hit the
same paths that subtransaction abort hits.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-12-15 15:50:12 Re: Fractions in GUC variables
Previous Message Tom Lane 2014-12-15 15:45:55 Re: Join push-down support for foreign tables