Re: [BUGS] BUG #5412: test case produced, possible race condition.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Rusty Conover <rconover(at)infogears(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [BUGS] BUG #5412: test case produced, possible race condition.
Date: 2010-04-14 18:31:44
Message-ID: 8731.1271269904@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

I wrote:
> [ theory about cause of Rusty's crash ]

I started to doubt this theory after wondering why the problem hadn't
been exposed by CLOBBER_CACHE_ALWAYS testing, which is done routinely
by the buildfarm. That setting would surely cause the cache flush to
happen at the troublesome time. After a good deal more investigation,
I found out why it doesn't crash with that. The problematic case is
for a relation that has rd_newRelfilenodeSubid nonzero but
rd_createSubid zero (ie, it's been truncated in the current xact).
Given that, RelationFlushRelation will attempt a rebuild but
RelationCacheInvalidate won't exempt the relation from destruction.
However, if you do a TRUNCATE under CLOBBER_CACHE_ALWAYS, the relcache
entry gets blown away immediately at the conclusion of that command,
because we'll do a RelationCacheInvalidate as a consequence of
CLOBBER_CACHE_ALWAYS. When the relcache entry is rebuilt for later use,
it won't have rd_newRelfilenodeSubid set, so it's not a hazard anymore.
In order to expose this bug, the relcache entry has to survive past the
TRUNCATE and then a cache flush has to occur while we are in process of
rebuilding it, not before.

What this suggests is that CLOBBER_CACHE_ALWAYS is actually too strong
to provide a thorough test of cache flush hazards. Maybe we need an
alternate setting along the lines of CLOBBER_CACHE_SOMETIMES that would
randomly choose whether or not to flush at any given opportunity. But
if such a setup did produce a crash, it'd be awfully hard to reproduce
for investigation. Ideas?

There is another slightly odd thing here, which is that the stack trace
Rusty provided clearly shows the crash occurring during processing of a
local relcache invalidation message for the truncated relation. This
would be expected during execution of the TRUNCATE itself, but at that
point the rel has positive refcnt so there's no problem. According to
the stack trace the active SQL command is an INSERT ... SELECT, and I
wouldn't expect that to queue any relcache invals. Are there any
triggers or other unusual things in the real application (not the
watered-down test case) that would be triggered in INSERT/SELECT?

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2010-04-14 18:39:19 Re: [BUGS] BUG #5412: test case produced, possible race condition.
Previous Message Kevin Grittner 2010-04-14 16:13:31 Re: BUG #5421: pg_attribute broken

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2010-04-14 18:39:19 Re: [BUGS] BUG #5412: test case produced, possible race condition.
Previous Message Kevin Grittner 2010-04-14 18:04:02 Re: shared_buffers documentation