Quick Links

BUG: Former primary node might stuck when started as a standby

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	BUG: Former primary node might stuck when started as a standby
Date:	2024-01-18 18:00:01
Message-ID:	b0102688-6d6c-c86a-db79-e0e91d245b1a@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello hackers,

[ reporting this bug here due to limitations of the bug reporting form ]

When a node, that acted as a primary, becomes a standby as in the
following script:
[ ... some WAL-logged activity ... ]
$primary->teardown_node;
$standby->promote;

$primary->enable_streaming($standby);
$primary->start;

it might not go online, due to the error:
new timeline N forked off current database system timeline M before current recovery point X/X

A complete TAP test is attached.
I put it in src/test/recovery/t, run as follows:
for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" || break;
done
and get:
...
iteration 7
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. ok
All tests successful.
Files=1, Tests=20, 14 wallclock secs ( 0.01 usr 0.00 sys + 4.20 cusr 4.75 csys = 8.96 CPU)
Result: PASS
iteration 8
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated

With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all),
I see the following in the _node1.log:
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0: - Transaction/COMMIT: 2024-01-18
15:21:02.258739+00
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: xlog flush request 0/304DBF0; write 0/0; flush 0/0
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time 2024-01-18
15:21:02.2588+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time 2024-01-18
15:21:02.258809+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time 2024-01-18
15:21:02.258864+00
2024-01-18 15:21:02.259 UTC [663563] DEBUG: server process (PID 663701) exited with exit code 0
2024-01-18 15:21:02.260 UTC [663563] DEBUG: forked new backend, pid=663704 socket=8
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608,
'background activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC40: - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background
activity');
2024-01-18 15:21:02.261 UTC [663563] DEBUG: postmaster received shutdown request signal
2024-01-18 15:21:02.261 UTC [663563] LOG: received immediate shutdown request
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68: - Transaction/COMMIT: 2024-01-18
15:21:02.261828+00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background
activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: xlog flush request 0/304DC68; write 0/0; flush 0/0
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background
activity');
...
2024-01-18 15:21:02.262 UTC [663563] LOG: database system is shut down
...
2024-01-18 15:21:02.474 UTC [663810] LOG: starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
...
2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 -
Transaction/COMMIT: 2024-01-18 15:21:02.258739+00
2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0:
rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 -
Transaction/COMMIT: 2024-01-18 15:21:02.261828+00
...
2024-01-18 15:21:02.481 UTC [663819] LOG: fetching timeline history file for timeline 20 from primary server
2024-01-18 15:21:02.481 UTC [663819] LOG: started streaming WAL from primary at 0/3000000 on timeline 19
...
2024-01-18 15:21:02.481 UTC [663819] DETAIL: End of WAL reached on timeline 19 at 0/304DBF0.
...
2024-01-18 15:21:02.481 UTC [663816] LOG: new timeline 20 forked off current database system timeline 19 before current
recovery point 0/304DC68

In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
only record 0/304DBF0, then node2, being promoted to primary, forked a next
timeline from it, but when node1 was started as a standby, it first
replayed 0/304DC68 from WAL, and then could not switch to the new timeline
starting from the previous position.

Reproduced on REL_12_STABLE .. master.

Best regards,
Alexander

Attachment	Content-Type	Size
099_change_roles.pl	application/x-perl	2.0 KB

Responses

Re: BUG: Former primary node might stuck when started as a standby at 2024-01-19 11:45:02 from Aleksander Alekseev

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Sergey Prokhorenko	2024-01-18 18:28:10	Re: UUID v7
Previous Message	Andy Fan	2024-01-18 17:51:56	Re: the s_lock_stuck on perform_spin_delay