BUG: Former primary node might stuck when started as a standby

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: BUG: Former primary node might stuck when started as a standby
Date: 2024-01-18 18:00:01
Message-ID: b0102688-6d6c-c86a-db79-e0e91d245b1a@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

[ reporting this bug here due to limitations of the bug reporting form ]

When a node, that acted as a primary, becomes a standby as in the
following script:
[ ... some WAL-logged activity ... ]
$primary->teardown_node;
$standby->promote;

$primary->enable_streaming($standby);
$primary->start;

it might not go online, due to the error:
new timeline N forked off current database system timeline M before current recovery point X/X

A complete TAP test is attached.
I put it in src/test/recovery/t, run as follows:
for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" || break;
done
and get:
...
iteration 7
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. ok
All tests successful.
Files=1, Tests=20, 14 wallclock secs ( 0.01 usr  0.00 sys +  4.20 cusr  4.75 csys =  8.96 CPU)
Result: PASS
iteration 8
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated

With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all),
I see the following in the _node1.log:
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0:  - Transaction/COMMIT: 2024-01-18
15:21:02.258739+00
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG:  xlog flush request 0/304DBF0; write 0/0; flush 0/0
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time 2024-01-18
15:21:02.2588+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time 2024-01-18
15:21:02.258809+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time 2024-01-18
15:21:02.258864+00
2024-01-18 15:21:02.259 UTC [663563] DEBUG:  server process (PID 663701) exited with exit code 0
2024-01-18 15:21:02.260 UTC [663563] DEBUG:  forked new backend, pid=663704 socket=8
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608,
'background activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC40:  - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background
activity');
2024-01-18 15:21:02.261 UTC [663563] DEBUG:  postmaster received shutdown request signal
2024-01-18 15:21:02.261 UTC [663563] LOG:  received immediate shutdown request
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68:  - Transaction/COMMIT: 2024-01-18
15:21:02.261828+00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background
activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG:  xlog flush request 0/304DC68; write 0/0; flush 0/0
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background
activity');
...
2024-01-18 15:21:02.262 UTC [663563] LOG:  database system is shut down
...
2024-01-18 15:21:02.474 UTC [663810] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
...
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 -
Transaction/COMMIT: 2024-01-18 15:21:02.258739+00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0:
rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 -
Transaction/COMMIT: 2024-01-18 15:21:02.261828+00
...
2024-01-18 15:21:02.481 UTC [663819] LOG:  fetching timeline history file for timeline 20 from primary server
2024-01-18 15:21:02.481 UTC [663819] LOG:  started streaming WAL from primary at 0/3000000 on timeline 19
...
2024-01-18 15:21:02.481 UTC [663819] DETAIL:  End of WAL reached on timeline 19 at 0/304DBF0.
...
2024-01-18 15:21:02.481 UTC [663816] LOG:  new timeline 20 forked off current database system timeline 19 before current
recovery point 0/304DC68

In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
only record 0/304DBF0, then node2, being promoted to primary, forked a next
timeline from it, but when node1 was started as a standby, it first
replayed 0/304DC68 from WAL, and then could not switch to the new timeline
starting from the previous position.

Reproduced on REL_12_STABLE .. master.

Best regards,
Alexander

Attachment Content-Type Size
099_change_roles.pl application/x-perl 2.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sergey Prokhorenko 2024-01-18 18:28:10 Re: UUID v7
Previous Message Andy Fan 2024-01-18 17:51:56 Re: the s_lock_stuck on perform_spin_delay