BUG #16517: Fast Shutdown sometimes is not working if we use logical decoding.

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: max(dot)evtushok(at)gmail(dot)com
Subject: BUG #16517: Fast Shutdown sometimes is not working if we use logical decoding.
Date: 2020-06-29 13:31:22
Message-ID: 16517-db20030710e402fa@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16517
Logged by: Евтушок Максим
Email address: max(dot)evtushok(at)gmail(dot)com
PostgreSQL version: 11.8
Operating system: Linux
Description:

I use a logical decoding feature (pgoutput plugin on the sender's side and
pgjdbc with Debezium on the receiver's side).

After insert + commit I wait for some time, watching in Postgres debug log
for values of write+flush+apply LSN positions (walsender.c).

Write position of walsender continues to grow because new data arrives into
WAL. But flush and apply positions freeze because no new data transferred to
the receiver.

It blocks fast shutdown of Postgres because there is a check that the write
position should be equal to the flush position in WalSndDone procedure
(https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walsender.c#L2852)

I think that the write position should not be incremented if no data sent to
a receiver.

Example:

Postgres log some time after insert:
2020-06-26 10:28:26.526 UTC [767983] PostgreSQL JDBC Driver
185.193.196.40(2838) DEBUG: write 0/1B1959E0 flush 0/1B1958C8 apply
0/1B1958C8

Waldump:
/usr/local/pgsql/data$ pg_waldump -s 0/1B1935C8
rmgr: Sequence len (rec/tot): 99/ 99, tx: 752659, lsn:
0/1B1935C8, prev 0/1B193590, desc: LOG rel 1663/16384/32801, blkref #0: rel
1663/16384/32801 blk 0
rmgr: Heap len (rec/tot): 75/ 583, tx: 752659, lsn:
0/1B193630, prev 0/1B1935C8, desc: INSERT off 11, blkref #0: rel
1663/16384/32803 blk 2119 FPW
rmgr: Btree len (rec/tot): 53/ 8193, tx: 752659, lsn:
0/1B193878, prev 0/1B193630, desc: INSERT_LEAF off 405, blkref #0: rel
1663/16384/32807 blk 1076 FPW
rmgr: Transaction len (rec/tot): 46/ 46, tx: 752659, lsn:
0/1B195898, prev 0/1B193878, desc: COMMIT 2020-06-26 10:23:59.694483 UTC
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
0/1B1958C8, prev 0/1B195898, desc: RUNNING_XACTS nextXid 752660
latestCompletedXid 752659 oldestRunningXid 752660
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
0/1B195900, prev 0/1B1958C8, desc: RUNNING_XACTS nextXid 752660
latestCompletedXid 752659 oldestRunningXid 752660
rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn:
0/1B195938, prev 0/1B195900, desc: CHECKPOINT_ONLINE redo 0/1B195900; tli 1;
prev tli 1; fpw true; xid 0:752660; oid 32811; multi 1; offset 0; oldest xid
561 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid:
0/0; oldest running xid 752660; online
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
0/1B1959A8, prev 0/1B195938, desc: RUNNING_XACTS nextXid 752660
latestCompletedXid 752659 oldestRunningXid 752660
pg_waldump: FATAL: error in WAL record at 0/1B1959A8: invalid record
length at 0/1B1959E0: wanted 24, got 0

Last message on network with WAL data is a commit message (XLogData (B)
https://www.postgresql.org/docs/11/protocol-replication.html):
0040 64 00 00 00 37 77 00 00 00 00 1b 19 58 c8
d...7w......XÈ
0050 00 00 00 00 1b 19 58 c8 00 02 4b f8 e0 40 2d ab
(dot)(dot)(dot)(dot)(dot)(dot)XÈ(dot)(dot)Køà(at)-«
0060 43 00 00 00 00 00 1b 19 58 98 00 00 00 00 1b 19
C.......X.......
0070 58 c8 00 02 4b f8 e0 40 26 93 XÈ..Køà@&.
Write position here - 0/1B1958C8

No new XLogData arrived, but keepalive LSN keep incrementing after the last
commit message:
0040 64 00 00 00 16 6b 00 00 00 00 1b 19 59 e0
d....k......Yà
0050 00 02 4b f8 ef 22 6c d6 00 ..Køï"lÖ.
Write position here - 0/1B1959E0

So there are increments of LSN not visible to receiver that couldn't be
flushed.

Because of that there is big probability that during fast shutdown keep
alive will have bigger LSN in keep-alive than flushed LSN on a receiver and
fast shutdown won't be possible.

Browse pgsql-bugs by date

  From Date Subject
Next Message Geoffrey Gordon Ashbrook 2020-06-29 16:49:03 fedora32 install guide error: "initdb" --> "--initdb"
Previous Message PG Bug reporting form 2020-06-29 09:13:52 BUG #16516: when testing jit get terminate called after throwing an instance of 'std::bad_function_call'