Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Date: 2016-05-10 12:09:02
Message-ID: CA+TgmoZ0PzoGMRBU-NOEx8YLEf5LeBF8TXiht5r0zeEVNpeT7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> The easy way to trigger this problem would be to have an oid wraparound
> - but the WAL shows that that's not the case here. I've not figured
> that one out entirely (and won't tonight). But I do see WAL records
> like:
> rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
> i.e. two NEXTOID records allocating the same range, which obviously
> doesn't seem right. There's also every now and then close by ranges:
> rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455
> rmgr: XLOG len (rec/tot): 4/ 30, tx: 7814505, lsn: 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461
>
>
> As far as I can see something like the above, or an oid wraparound, are
> pretty much deadly for toast.
>
> Is anybody ready with a good defense for SatisfiesToast not doing any
> actual liveliness checks?

I assume that this was installed as a performance optimization, and I
don't really see why it shouldn't be or be able to be made safe. I
assume that the wraparound case was deemed safe because at that time
the idea of 4 billion OIDs getting used with old transactions still
active seemed inconceivable. It seems to me that the real question
here is how you're getting two calls to XLogPutNextOid() with the same
value of ShmemVariableCache->nextOid, and the answer, as it seems to
me, must be that LWLocks are broken. Either two processes are
managing to hold OidGenLock in exclusive mode at the same time, or
they're acquiring it in quick succession but without the second
process seeing all of the updates performed by the first process.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-05-10 12:09:22 Hash Indexes
Previous Message Rajeev rastogi 2016-05-10 12:05:23 Re: asynchronous and vectorized execution