Autovacuum breakage from a734fd5d1

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Autovacuum breakage from a734fd5d1
Date: 2016-11-27 18:45:47
Message-ID: 21299.1480272347@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Buildfarm member skink failed a couple days ago:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2016-11-25%2017%3A50%3A01

I believe the interesting parts of the log are

2016-11-25 18:29:03.285 UTC [583882e7.2a45:1] LOG: autovacuum: dropping orphan temp table "(null)"."(null)" in database "regression"
2016-11-25 18:29:03.919 UTC [583882e7.2a45:2] ERROR: could not open relation with OID 33895
2016-11-25 18:29:03.919 UTC [583882e7.2a45:3] CONTEXT: dropping of orphan temp table "(null)"."(null)" in database "regression"

TRAP: FailedAssertion("!(pointer != ((void *)0))", File: "/home/andres/build/buildfarm/HEAD/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c", Line: 1001)

2016-11-25 18:29:07.524 UTC [58387ba7.609:3] LOG: server process (PID 10821) was terminated by signal 6: Aborted
2016-11-25 18:29:07.526 UTC [58387ba7.609:4] LOG: terminating any other active server processes

10821 = 2a45 hex so the first three log entries are from the process that
later asserted. The assertion is evidently from a "pfree(NULL)" call.

Looking into autovacuum.c, it's pretty clear that it was trying to drop
what it thought was an orphan temp table, only the temp table was already
gone, causing get_namespace_name() and get_rel_name() to return NULL.
It's only a platform-specific artifact that the ereport calls printed
"(null)" rather than crashing, and of course the pfree(relname) down at
the bottom of the loop is what caused the assert failure.

Presumably, what happened here is that somebody else already deleted
the "orphan" table before autovacuum could get to it. Since there's
no previous crash in this run, it's difficult to come to any conclusion
other than that the "orphan" got deleted by its owning backend, meaning
that the test for orphan status is jumping the gun and deciding that
a table is orphaned before its owning session is really gone. That's
rather disastrous if true.

The other problem here is that the code is supposing that it can access
a temp table's catalog entries without having obtained any lock on it.
Correct coding would involve acquiring the relation lock and then
checking to see if the table is still there.

I am also not feeling terribly comfortable with the potentially long
interval between where we decide that relation OID so-and-so is an
orphaned table and where we summarily remove that relation OID. This
is probably safe, if you assume that temp tables aren't likely to
have lifespans comparable to the OID counter's wraparound interval,
but I think we would be well advised to recheck that decision after
acquiring lock.

At least some of these behaviors were presumably hazards before, but were
effectively masked by the long delay before we'd attempt to remove an
orphan temp table. But I am wondering why we've not seen reports of
crashes due to the old "found orphan temp table" log message trying to
dereference null pointers, since that could have occurred immediately
after the table became (ostensibly) orphan.

I don't have proposed fixes for any of this yet, but it's clearly
an area that needs review.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2016-11-27 18:57:06 Re: Logical Replication WIP
Previous Message Tomas Vondra 2016-11-27 18:42:58 Re: PATCH: two slab-like memory allocators