Re: [HACKERS] Sure enough, SI buffer overrun is broken

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Sure enough, SI buffer overrun is broken
Date: 2000-01-29 19:53:19
Message-ID: 29928.949175599@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> writes:
>> I built the current sources with MAXNUMMESSAGES set to 32 in
>> src/include/storage/sinvaladt.h. The regular regress tests
>> run OK, with just a few NOTICEs about 'cache state reset'
>> and 'SI buffer overflow' inserted in the normal outputs
>> (as you'd expect, if SI overrun occurs).
>>
>> However, the parallel tests crash spectacularly, with weird errors
>> and Assert() coredumps.

> Is the call RelationCacheInvalidate(false not true) in ResetSys-
> temCaches() right ? Relation descriptors would be destoryed if
> ResetSystemCaches() occurs in CommandConterIncrement().

You are absolutely right. I have thought before that it was extremely
dangerous for RelationCacheInvalidate *ever* to blow away a relcache
entry with a positive refcount. I have changed ResetSystemCaches to
pass TRUE, and have modified the comments for RelationCacheInvalidate
to indicate that this is probably the only safe setting. I also changed
RelationIdInvalidateRelationCacheByRelationId to unconditionally pass
true to RelationFlushRelation. Now, RelationFlushRelation is *never*
called with onlyFlushReferenceCountZero=false, so it will always attempt
to rebuild a relcache entry that has positive refcount.

I suspect that the "feature" of RelationFlushRelation to allow blowing
away a relcache entry regardless of refcount was a hack put in back when
relcache refcounts couldn't be trusted (because elog(ERROR) would leave
refcounts positive). Now we have RelationCacheAbort to fix refcounts
after elog, so I see no reason to take the risk of trying to destroy an
open relcache entry.

With these two changes in place, the parallel regress tests seem much
more stable. There is still a big problem though, which is that the
"portals" regress test sometimes fails. I traced this far enough to
discover that the code is trying to use a TupleDesc that it's stored in
the ScanTupleSlot of a plan, and this TupleDesc is no longer valid ---
presumably the relcache entry it was gotten from has been flushed and
rebuilt, leaving the plan with a dangling TupleDesc pointer. Ugh.

I do not think it is very practical to try to change all of the places
that assume that they can save pointers to the TupleDesc of a relcache
entry. Instead I am thinking about solving the problem inside the
relcache, as follows:

During a relcache entry rebuild, do not simply free and reconstruct
the TupleDesc. Instead, read the catalogs to build a new TupleDesc,
and compare this one to the old one. If they are the same, free the
new one instead (leaving the old one in place, and hence stored pointers
to it are still valid). If they are not the same, then elog(ERROR).

(elog may sound overly paranoid, but this condition indicates that the
table's definition has actually changed since we grabbed the refcount on
it --- remember we wouldn't be doing this at all if the relcache entry had
zero refcount --- and therefore all kinds of derived information such as
plans may be wrong. Pressing ahead will probably lead to crash.)

We may need to do the same for any other substructures of the relcache
entry that are visible from outside relcache.c.

I know this sounds pretty grotty, but we are already doing it for the
relcache entry itself --- rebuild re-uses the same physical entry
rather than deleting and reallocating it, to ensure that pointers
to an open Relation stay valid over an SI flush. We need to extend
the same guarantee to the substructure of the relcache entry.

Unless someone has a better idea, I'll work on that.

>> Some of the unexpected messages in the
>> postmaster log are:
>>
>> NOTICE: LockRelease: locktable lookup failed, no lock

> I have seen this NOTICE only once or twice.
> This seems because of setting MyProc->xid to InvalidTransa
> ctionId in CommitTransaction() and AbortTransaction().
> There's a little time until AtCommit(Abort)_Locks.
> I have no idea to solve this now.

I am not seeing it after the change to never flush relcaches with
positive refcount. I think the locks being complained of are
probably the locks that should have been kept on flushed relations,
and that it's not CommitTransaction's fault.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2000-01-29 20:11:02 Re: [HACKERS] Copyright
Previous Message Jeff MacDonald <jeff@pgsql.com> 2000-01-29 19:15:34 Re: [HACKERS] Postgres LOGO(was: Beta is February 15th)