Re: Postgresql 8.4.1 segfault, backtrace

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Brown <mbrown(at)fensystems(dot)co(dot)uk>
Cc: Richard Neill <rn214(at)hermes(dot)cam(dot)ac(dot)uk>, pgsql-bugs(at)postgreSQL(dot)org
Subject: Re: Postgresql 8.4.1 segfault, backtrace
Date: 2009-09-24 23:00:54
Message-ID: 730.1253833254@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Michael Brown <mbrown(at)fensystems(dot)co(dot)uk> writes:
>> ... (If you have a spare machine with the same OS and
>> the same postgres executables, maybe you could put the core file on that
>> and let me ssh in to have a look?)

[ ssh details ]

Thanks for letting me poke around. What I found out is that the
hash_seq_search loop in RelationCacheInitializePhase2 is crashing
because it's attempting to examine a hashtable entry that is on the
hashtable's freelist!? Given that information I think the cause of
the bug is fairly clear:

1. RelationCacheInitializePhase2 loads the rules or trigger descriptions
for some system catalog (actually it must be the latter; we haven't got
any catalogs with rules attached).

2. By chance, a shared-cache-inval flush comes through while it's doing
that, causing all non-open, non-nailed relcache entries to be discarded.
Including, in particular, the one that is "next" according to the
hash_seq_search's status.

3. Now the loop iterates into the freelist, and kaboom. It will
probably fail to fail on entries that are actually discarded, because
they still have valid pointers in them ... but as soon as it gets to
a never-yet-used freelist entry, it'll do a null dereference.

RelationCacheInitializePhase2 is breaking the rules by assuming that it
can continue to iterate the hash_seq_search after doing something that
might cause a hash entry other than the current one to be discarded.
We can probably fix that without too much trouble, eg by restarting the
loop after an update.

But: the question at this point is why we've never seen such a report
before 8.4. If this theory is correct, it's been broken for a *long*
time. I can think of a couple of possible explanations:

A: the problem can only manifest if this loop has work to do for
a relcache entry that is not the last one in its bucket chain.
8.4 might have added more preloaded relcache entries than were there
before. Or the 8.4 changes in the hash functions might have shuffled
the entries' bucket placement around so that the problem can happen
when it couldn't before.

B: the 8.4 changes in the shared-cache-inval mechanism might have
made it more likely that a freshly started backend could get hit with a
relcache flush request. I should think that those changes would have
made this *less* likely not more so, so maybe there is an additional
bug lurking in that area.

I shall go and do some further investigation, but at least it's now
clear where to look. Thanks for the report, and for being so helpful
in providing information!

regards, tom lane

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Brown 2009-09-24 23:33:18 Re: Postgresql 8.4.1 segfault, backtrace
Previous Message Michael Brown 2009-09-24 22:07:49 Re: Postgresql 8.4.1 segfault, backtrace