Re: False "pg_serial": apparent wraparound” in logs

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: "Imseih (AWS), Sami" <simseih(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: False "pg_serial": apparent wraparound” in logs
Date: 2023-10-01 18:43:21
Message-ID: 8c9567a5-13cc-cefa-acec-228ae1db23dc@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30/09/2023 02:16, Imseih (AWS), Sami wrote:
> The initial idea was to advance the latest_page_number
> during SerialSetActiveSerXmin, but the initial approach is
> obviously wrong.

That approach at high level could work, a

> When SerialSetActiveSerXmin is called for a new active
> serializable xmin, and at that point we don't need to keep any
> any earlier transactions, should SimpleLruZeroPage be called
> to ensure there is a target page for the xid?
>
> I tried something like below, which fixes my repro, by calling
> SimpleLruZeroPage at the end of SerialSetActiveSerXmin.
>
> @@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid)
> static void
> SerialSetActiveSerXmin(TransactionId xid)
> {
> + int targetPage = SerialPage(xid);
> +
> LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE);
>
> /*
> @@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid)
>
> serialControl->tailXid = xid;
>
> + if (serialControl->headPage != targetPage)
> + SimpleLruZeroPage(SerialSlruCtl, targetPage);
> +
> LWLockRelease(SerialSLRULock);
> }

No, that's very wrong too. You are zeroing the page containing the
oldest XID that's still needed. That page still contains important
information. It might work if you zero the previous page, but I think
you need to do a little more than that. (I wish we had tests that would
catch that.)

The crux of the problem is that 'tailXid' can advance past 'headXid'. I
was bit surprised by that, but I think it's by design. I wish it was
called out explicitly in a comment though. The code mostly handles that
fine, except that it confuses the "apparent wraparound" check.

'tailXid' is the oldest XID that we might still need to look up in the
SLRU, based on the transactions that are still active, and 'headXid' is
the newest XID that has been written out to the SLRU. But we only write
an XID out to the SLRU and advance headXid if the shared memory data
structure fills up. So it's possible that as old transactions age out,
we advance 'tailXid' past 'headXid'.

SerialAdd() tolerates tailXid > headXid. It will zero out all the pages
between the old headXid and tailXid, even though no lookups can occur on
those pages. That's unnecessary but harmless.

I think the smallest fix here would be to change CheckPointPredicate()
so that if tailPage > headPage, pass headPage to SimpleLruTruncate()
instead of tailPage. Or perhaps it should go into the "The SLRU is no
longer needed" codepath in that case. If tailPage > headPage, the SLRU
isn't needed at the moment.

In addition to that, we could change SerialAdd() to not zero out the
pages between old headXid and tailXid unnecessarily, but that's more of
an optimization than bug fix.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-10-01 18:46:02 Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
Previous Message Peter Eisentraut 2023-10-01 18:41:53 Re: Evaluate arguments of correlated SubPlans in the referencing ExprState