xlog.c: removing ReadRecPtr and EndRecPtr

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: xlog.c: removing ReadRecPtr and EndRecPtr
Date: 2021-11-17 22:01:36
Message-ID: CA+Tgmoao96EuNeSPd+hspRKcsCddu=b1h-QNRuKfY8VmfNQdfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I spent a lot of time trying to figure out why xlog.c has global
variables ReadRecPtr and EndRecPtr instead of just relying on the
eponymous structure members inside the XLogReaderState. I concluded
that the values are the same at most points in the code, and thus that
we could just use xlogreaderstate->{Read,End}RecPtr instead. There are
two places where this wouldn't produce the same results we're getting
today. Both of those appear to be bugs.

The reason why it's generally the case that ReadRecPtr ==
xlogreaderstate->ReadRecPtr and likewise for EndRecPtr is that
ReadRecord() calls XLogReadRecord(), which sets the values inside the
XLogReaderState, and then immediately assigns the values stored there
to the global variables. There's no other code that changes the other
global variables, and the only other code that changes the structure
members is XLogBeginRead(). So the values can be unequal from the time
XLogBeginRead() is called up until the time that the XLogReadRecord()
call inside ReadRecord() returns. In practice, StartupXLOG() always
calls ReadRecord() right after it calls XLogBeginRead(), and
ReadRecord() does not reference either global variable before calling
XLogReadRecord(), so the problem surface is limited to code that runs
underneath XLogReadRecord(). XLogReadRecord() is part of xlogreader.c,
but it uses a callback interface: the callback is XLogPageRead(),
which itself references EndRecPtr, and also calls
WaitForWALToBecomeAvailable(), which in turn calls
rescanLatestTimeLine(), which also references EndRecPtr. So these are
the two problem cases: XLogPageRead(), and rescanLatestTimeLine().

In rescanLatestTimeLine(), the problem is IMHO probably serious enough
to justify a separate commit with back-patching. The problem is that
EndRecPtr is being used here to reject impermissible attempts to
switch to a bad timeline, but if pg_wal starts out empty, EndRecPtr
will be 0 here, which causes the code to fail to detect a case that
should be prohibited. Consider the following test case:

- create a primary
- create standby #1 from the primary
- start standby #1 and promote it
- take a backup from the primary using -Xnone to create standby #2
- clear primary_conninfo on standby #2 and then start it
- copy 00000002.history from standby #1 to standby #2

You get:

2021-11-17 15:34:26.213 EST [7474] LOG: selected new timeline ID: 2

But with the attached patch, you get:

2021-11-17 16:12:01.566 EST [20900] LOG: new timeline 2 forked off
current database system timeline 1 before current recovery point
0/A000060

Had the problem occurred at some later point in the WAL stream rather
than before fetching the very first record, I think everything is
fine; at that point, I think that the global variable EndRecPtr will
be initialized. I'm not entirely sure that it contains exactly the
right value, but it's someplace in the right ballpark, at least.

In XLogPageRead(), the problem is just cosmetic. We're only using
EndRecPtr as an argument to emode_for_corrupt_record(), which is all
about suppressing duplicate complaints about the same LSN. But if the
xlogreader has been repositioned using XLogBeginRead() since the last
call to ReadRecord(), or if there are no preceding calls to
ReadRecord(), then the value of EndRecPtr here is left over from the
previous read position and is not particularly related to the record
we're reading now. xlogreader->EndRecPtr, OTOH, is. This doesn't seem
worth a separate commit to me, or a back-patch, but it seems worth
fixing while I'm cleaning up these global variables.

Review appreciated.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v1-0001-xlog.c-Remove-global-variables-ReadRecPtr-and-End.patch application/octet-stream 10.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-11-17 22:01:37 Re: Windows: Wrong error message at connection termination
Previous Message Tom Lane 2021-11-17 21:54:43 Re: Add planner support function for starts_with()