BUG #16764: wal sender exits abnormally which cause wal receiever exits twice

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: bchen90(at)163(dot)com
Subject: BUG #16764: wal sender exits abnormally which cause wal receiever exits twice
Date: 2020-12-06 10:12:16
Message-ID: 16764-bfe6bcd175183075@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16764
Logged by: Bo Chen
Email address: bchen90(at)163(dot)com
PostgreSQL version: 11.8
Operating system: Suse linux 11.3 x86_64
Description:

Hi,hackers

For hot standby mode, I constructed the following operations,kills the
walsender process while inserting some data to the test table. I found the
walreceiver process exits twice abnormally , obviously first exiting is
because the exit of the sender,but when the wal replication stream
reestablished,wal walreceiver killed by startup because invaild xlog
record.

I analysis the problem, it is found that the xlog record of the problem
location(across page record) changes before and after the sender exits, and
the startup process of the standby has already loaded the previous page of
the changing record into memory before the exit of the receiver. As a
result, invalid xlog record is founded and the receiver process killed by
strartup.

ps: I test 11.5 and the problem is still exists, I believe this problem has
been around for a long time.

Is tihis problem normally?

I'm worried about this problem could cause standby data damage in extreme
scenarios(for the same contrecord length and crc, but not the same xlog
record)

insert data procedure:
create or replace procedure test_procedure(in_count in integer)
as
$$
declare
i_count integer;
begin
FOR i_count IN 1..in_count LOOP
INSERT INTO test_table
VALUES(i_count,'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');
end loop;
delete from test_table;
end;
$$ language plpgsql;

standby database runing log:

2020-11-27 12:14:13.288 CGP [38116] DEBUG: sendtime 2020-11-27
12:14:13.287768+00 receipttime 2020-11-27 12:14:13.288081+00 replication
apply delay 26585 ms transfer latency 0 ms
2020-11-27 12:14:13.288 CGP [38116] DEBUG: sending write 1/B1FB8E0 flush
1/B1F4FA0 apply 1/B1F4EF8
2020-11-27 12:14:13.288 CGP [38116] DEBUG: sending write 1/B1FB8E0 flush
1/B1FB8E0 apply 1/B1F4EF8
2020-11-27 12:14:13.289 CGP [7804] LOG: invalid contrecord length 152 at
1/B1F4EF8
2020-11-27 12:14:13.289 CGP [38116] FATAL: terminating walreceiver process
due to administrator command
2020-11-27 12:14:13.289 CGP [38116] DEBUG: shmem_exit(1): 1
before_shmem_exit callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG: shmem_exit(1): 5 on_shmem_exit
callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG: proc_exit(1): 2 callbacks to
make
2020-11-27 12:14:13.289 CGP [38116] DEBUG: exit(1)
2020-11-27 12:14:13.289 CGP [38116] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2020-11-27 12:14:13.289 CGP [38116] DEBUG: proc_exit(-1): 0 callbacks to
make
2020-11-27 12:14:13.289 CGP [7804] DEBUG: switched WAL source from stream
to archive after failure
2020-11-27 12:14:13.289 CGP [7804] DEBUG: transaction ID wrap limit is
2147484208, limited by database with OID 1
2020-11-27 12:14:13.289 CGP [7804] CONTEXT: WAL redo at 1/B1F4EF8 for
XLOG/CHECKPOINT_SHUTDOWN: redo 1/B1F4EF8; tli 1; prev tli 1; fpw true; xid
0:99176; oid 24585; multi 1; offset 0; oldest xid 561 in DB 1; oldest multi
1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown

Browse pgsql-bugs by date

  From Date Subject
Next Message Justin Pryzby 2020-12-06 18:02:48 pg_upgrade test for binary compatibility of core data types
Previous Message Tom Lane 2020-12-05 21:17:06 Re: Small query yields "error: no relation entry for relid 11"