Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Date: 2016-05-11 00:58:33
Message-ID: 20160511005832.ilpxhld3o6yvsryb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-05-10 17:36:06 -0700, Jeff Janes wrote:
> OK, got it. I don' t know how I missed the bigger picture of that
> function in the first place.
>
> But, another perhaps stupid question, why do we care what the value of
> nextOid was at the start of the last successfully completed
> checkpoint? Wouldn't it make more sense to populate that part of the
> record at the end, not the start, of the checkpoint?

I don't think that really makes a large difference. Unless we'd hold
OidGenLock across the XLogInsert() - which we don't want to do for
performance / deadlock reasons - we'd have a similar race. I think
it's simply important to recognize (and better document!) that the
values in the checkpoint record are from roughly the time of the REDO
pointer and not from the end of the checkpoint.

> >> I don't understand how it could be out-of-date at that point. But
> >> obviously it is.
> >
> > A checkpoint logically "starts" at 1) in the above abbreviated
> > CreateCheckPoint(), that's where recovery starts when starting up from
> > that checkpoint. But inbetween 1) and 4) all other backends can continue
> > to insert WAL, and it'll be replayed *before* the checkpoint's record
> > itself. That means that if some backend generates a NEXTOID record
> > between 2) and 4) (with largers checkpoints we're looking at minutes to
> > an hour of time), it's effects will temporarily take effect (as in
> > ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's
> > replay will overwrite it unconditionally:
> > void
> > xlog_redo(XLogReaderState *record)
> > {
> > else if (info == XLOG_CHECKPOINT_ONLINE)
> > {
> > ...
> > /* ... but still treat OID counter as exact */
> > LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
> > ShmemVariableCache->nextOid = checkPoint.nextOid;
> > ShmemVariableCache->oidCount = 0;
> > LWLockRelease(OidGenLock);
> >
> > Makes sense?
>
> So it seems like we should unconditionally **not** do that when
> replaying XLOG_CHECKPOINT_ONLINE, right?

I think we should actually just not do it during replay *at all*. It
seems entirely sufficient to use ->nextOid in StartupXLOG when it has
found the checkpoint to start up from:
/* initialize shared memory variables from the checkpoint record */
ShmemVariableCache->nextXid = checkPoint.nextXid;
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;

FWIW, I've commented out the relevant sections from xlog_redo and since
then I've not been able to reproduce the issue.

> But, to round this out, all of this is formally only a hint on where
> to start OIDs so as to avoid performance problems as you busy-loop
> looking for an unused OID. The real correctness bug is in the
> hint-bit problem you discuss elsewhere that turns collisions into
> corruption, and this bug just makes that other one reachable?

That's my opinion at least.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2016-05-11 01:01:12 Re: ALTER TABLE lock downgrades have broken pg_upgrade
Previous Message Fabrízio de Royes Mello 2016-05-11 00:56:24 Re: Does Type Have = Operator?