Re: False "pg_serial": apparent wraparound” in logs

From: "Imseih (AWS), Sami" <simseih(at)amazon(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: False "pg_serial": apparent wraparound” in logs
Date: 2023-08-23 20:49:29
Message-ID: 5991E478-2DE4-4CE0-98D6-62F090F3505B@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I dug a bit into this and what looks to be happening is the comparison
of the page containing the latest cutoff xid could falsely be reported
as in the future of the last page number because the latest
page number of the Serial slru is only set when the page is
initialized [1].

So under the correct conditions, such as in the repro, the serializable
XID has moved past the last page number, therefore to the next checkpoint
which triggers a CheckPointPredicate, it will appear that the slru
has wrapped around.

It seems what may be needed here is to advance the
latest_page_number during SerialSetActiveSerXmin and if
we are using the SLRU. See below:

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 1af41213b4..6946ed21b4 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -992,6 +992,9 @@ SerialSetActiveSerXmin(TransactionId xid)

serialControl->tailXid = xid;

+ if (serialControl->headPage > 0)
+ SerialSlruCtl->shared->latest_page_number = SerialPage(xid);
+
LWLockRelease(SerialSLRULock);
}

[1] https://github.com/postgres/postgres/blob/master/src/backend/access/transam/slru.c#L306

Regards,

Sami

From: "Imseih (AWS), Sami" <simseih(at)amazon(dot)com>
Date: Tuesday, August 22, 2023 at 7:56 PM
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: False "pg_serial": apparent wraparound” in logs

Hi,

I Recently encountered a situation on the field in which the message
“could not truncate directory "pg_serial": apparent wraparound”
was logged even through there was no danger of wraparound. This
was on a brand new cluster and only took a few minutes to see
the message in the logs.

Reading on some history of this error message, it appears that there
was work done to improve SLRU truncation and associated wraparound
log messages [1]. The attached repro on master still shows that this message
can be logged incorrectly.

The repro runs updates with 90 threads in serializable mode and kicks
off a “long running” select on the same table in serializable mode.

As soon as the long running select commits, the next checkpoint fails
to truncate the SLRU and logs the error message.

Besides the confusing log message, there may also also be risk with
pg_serial getting unnecessarily bloated and depleting the disk space.

Is this a bug?

[1] https://www.postgresql.org/message-id/flat/20190202083822.GC32531%40gust.leadboat.com

Regards,

Sami Imseih
Amazon Web Services (AWS)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-08-23 20:53:12 Re: BUG #18059: Unexpected error 25001 in stored procedure
Previous Message Mark Woodward 2023-08-23 20:42:27 Re: Let's make PostgreSQL multi-threaded