Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Date: 2016-05-10 21:15:56
Message-ID: 20160510211556.rumt74jrhqhsaxqx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-05-10 13:17:52 -0700, Jeff Janes wrote:
> On Tue, May 10, 2016 at 9:19 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> >> On Tue, May 10, 2016 at 3:05 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> > The easy way to trigger this problem would be to have an oid wraparound
> >> > - but the WAL shows that that's not the case here. I've not figured
> >> > that one out entirely (and won't tonight). But I do see WAL records
> >> > like:
> >> > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> >> > rmgr: XLOG len (rec/tot): 4/ 30, tx: 0, lsn: 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
>
> Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID
> records, between those two records you show?

Yes, check http://www.postgresql.org/message-id/20160510210013.2akn4iee7gl4ycen@alap3.anarazel.de

I think the explanation about how the bug is occuring there makes sense.

> My current test harness updates the scalar count field on every
> iteration, but changes the (probably toasted) text_array field with a
> probability of only 1% each time. Perhaps making that more likely (by
> changing line 186 of count.pl) would make it easier to trigger the
> bug. I'll try that in my next iteration of tests.

So my current theory about why the whole thing is kinda hard to
reproduce is that "luck" determines how aggressively the toast table is
vacuumed, and how often it actually succeeds in being vacuumed. You also
need a good bit of bad luck for the hint bits by GetNewOidWithIndex() to
not survive, given that shared_buffers is pretty small *and* checksums
are enabled.

I guess testing with a bigger shared memory and without checksums will
make it easier to hit the bug.

Regards,

Andres

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-05-10 22:23:43 Re: Perf Benchmarking and regression.
Previous Message Andres Freund 2016-05-10 21:12:07 Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)