ThisTimeLineID can be used uninitialized

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <anarazel(at)gmail(dot)com>, amul sul <sulamul(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: ThisTimeLineID can be used uninitialized
Date: 2021-10-19 19:13:04
Message-ID: CA+TgmoZC2aYe9eRXcUa_c_XBBMVLps9LZL_K5Oo7M9Xhco2KhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

This is a followup to
http://postgr.es/m/CA+TgmoZ5A26C6OxKApafyuy_sx0VG6VXdD_Q6aSEzsvrPHDwzw@mail.gmail.com.
I'm suspicious of the following code in CreateReplicationSlot:

/* setup state for WalSndSegmentOpen */
sendTimeLineIsHistoric = false;
sendTimeLine = ThisTimeLineID;

The first thing that's odd about this is that if this is physical
replication, it's apparently dead code, because AFAICT sendTimeLine
will not be used for anything in that case. If it's logical
replication, there's a case where this will set sendTImeLine to 0,
which seems very strange, since that is not a valid timeline. To
reproduce, do this:

1. Create a new primary database with tli=1. Create a standby.

2. On the standby, fire up a database-connected replication, something
like this:
psql 'port=5433 replication=database dbname=rhaas'
Don't execute any commands yet!

2. From some other backend, promote the standby:
select pg_promote();
It now gets a TLI of 2.

3. Try to create a logical replication slot perhaps using something like this:
CREATE_REPLICATION_SLOT "foo" LOGICAL "test_decoding" ( SNAPSHOT 'nothing');

If the system had been in normal running when you started the session,
it would be initialized, because InitPostgres() calls
RecoveryInProgress(). But since that only initializes it during normal
running, and not during recovery, it doesn't help in this scenario.
And if on the other hand you had not promoted the standby as in step
2, then we'd still set sendTimeLine = 0 here, but then we'd almost
immediately call CheckLogicalDecodingRequirements() and error out
without doing anything with the value. Here, however, we continue on.

But I don't know if it matters. We call CreateInitDecodingContext()
with sendTimeLine and ThisTimeLineID still zero; it doesn't call any
callbacks. Then we call DecodingContextFindStartpoint() with
sendTimeLine still 0 and the first callback that gets invoked is
logical_read_xlog_page. At this point sendTimeLine = 0 and
ThisTimeLineID = 0. That calls XLogReadDetermineTimeline() which
resets ThisTimeLineID to the correct value of 2, but when we get back
to logical_read_xlog_page, we still manage to call WALRead with a
timeline of 0 because state->seg.ws_tli is still 0. And when WALRead
eventually does call WalSndOpen, which unconditionally propagates
sendTimeLine into the TLI pointer that is passed to it. So now
state->seg_ws_tli also ends up being 2. So I guess maybe nothing bad
happens? But it sure seems strange that the code would apparently work
just as well as it does today with the following patch:

diff --git a/src/backend/replication/walsender.c
b/src/backend/replication/walsender.c
index b811a5c0ef..44fd598519 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -945,7 +945,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)

/* setup state for WalSndSegmentOpen */
sendTimeLineIsHistoric = false;
- sendTimeLine = ThisTimeLineID;
+ sendTimeLine = rand() % 10;

if (cmd->kind == REPLICATION_KIND_PHYSICAL)
{

And in fact, that passes make check-world. :-(

--
Robert Haas
EDB: http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-10-19 19:22:15 Re: [RFC] building postgres with meson
Previous Message Andrew Dunstan 2021-10-19 18:54:58 Re: Postgres perl module namespace