BUG #16125: Crash of PostgreSQL's wal sender during logical replication

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: andrey(dot)salnikov(at)dataegret(dot)com
Subject: BUG #16125: Crash of PostgreSQL's wal sender during logical replication
Date: 2019-11-18 18:41:43
Message-ID: 16125-a78dafa032d53f7e@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16125
Logged by: Andrey Salnikov
Email address: andrey(dot)salnikov(at)dataegret(dot)com
PostgreSQL version: 10.11
Operating system: Ubuntu 18.04.2 LTS
Description:

Hi,
PostgreSQL's wal sender begins continuously crash during logical
replication. Each start after crash sender tries to decode the same part of
wal and crash again and again.

Log contains next lines:
2019-11-16 03:01:39.512 UTC 22235 @ from [vxid: txid:0] [] LOG: server
process (PID 26972) was terminated by signal 11: Segmentation fault
--
2019-11-16 03:03:16.240 UTC 8286 @ from [vxid: txid:0] [] LOG: checkpoint
complete: wrote 675790 buffers (32.2%); 0 WAL file(s) added, 198 removed, 0
recycled; write=68.768 s, sync=0.011 s, total=69.214 s; sync files=361,
longest=0.002 s, average=0.000 s; distance=3164435 kB, estimate=3164435 kB
2019-11-16 03:03:16.251 UTC 22235 @ from [vxid: txid:0] [] LOG: database
system is ready to accept connections
2019-11-16 03:03:16.533 UTC 10070 rep_usr(at)db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG: received replication command: IDENTIFY_SYSTEM
2019-11-16 03:03:16.533 UTC 10070 rep_usr(at)db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG: received replication command: START_REPLICATION SLOT "rep_slot"
LOGICAL 25EE/D6DD72D0 (proto_version '1', publication_names '"rep_pub"')
2019-11-16 03:03:16.534 UTC 10070 rep_usr(at)db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG: starting logical decoding for slot "rep_slot"
2019-11-16 03:03:16.534 UTC 10070 rep_usr(at)db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] DETAIL: streaming transactions committing after 25EE/D614FC78,
reading WAL from 25EE/D614B580
2019-11-16 03:03:16.534 UTC 10070 rep_usr(at)db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG: logical decoding found consistent point at 25EE/D614B580
2019-11-16 03:03:16.534 UTC 10070 rep_usr(at)db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] DETAIL: There are no running transactions.
2019-11-16 03:03:16.732 UTC 22235 @ from [vxid: txid:0] [] LOG: server
process (PID 10070) was terminated by signal 11: Segmentation fault
--
2019-11-16 03:03:18.410 UTC 10072 @ from [vxid: txid:0] [] LOG: database
system was not properly shut down; automatic recovery in progress
2019-11-16 03:03:18.413 UTC 10072 @ from [vxid: txid:0] [] LOG: invalid
record length at 25EE/D6E155C8: wanted 24, got 0
2019-11-16 03:03:18.413 UTC 10072 @ from [vxid: txid:0] [] LOG: redo is
not required
2019-11-16 03:03:18.419 UTC 10072 @ from [vxid: txid:0] [] LOG: checkpoint
starting: end-of-recovery immediate
2019-11-16 03:03:18.467 UTC 10072 @ from [vxid: txid:0] [] LOG: checkpoint
complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
recycled; write=0.028 s, sync=0.000 s, total=0.050 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
2019-11-16 03:03:18.476 UTC 22235 @ from [vxid: txid:0] [] LOG: database
system is ready to accept connections
2019-11-16 03:03:18.754 UTC 10112 rep_usr(at)db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG: received replication command: IDENTIFY_SYSTEM
2019-11-16 03:03:18.754 UTC 10112 rep_usr(at)db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG: received replication command: START_REPLICATION SLOT "rep_slot"
LOGICAL 25EE/D6DD72D0 (proto_version '1', publication_names '"rep_pub"')
2019-11-16 03:03:18.754 UTC 10112 rep_usr(at)db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG: starting logical decoding for slot "rep_slot"
2019-11-16 03:03:18.754 UTC 10112 rep_usr(at)db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] DETAIL: streaming transactions committing after 25EE/D614FC78,
reading WAL from 25EE/D614B580
2019-11-16 03:03:18.754 UTC 10112 rep_usr(at)db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG: logical decoding found consistent point at 25EE/D614B580
2019-11-16 03:03:18.754 UTC 10112 rep_usr(at)db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] DETAIL: There are no running transactions.
2019-11-16 03:03:18.949 UTC 22235 @ from [vxid: txid:0] [] LOG: server
process (PID 10112) was terminated by signal 11: Segmentation fault

Core dump contains:
Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/92/b3b790a5135245f2cbb2a0a2f48fc7ff9f689a.debug...done.
done.
[New LWP 13678]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 10/main: wal sender process replication
1.1.1.1(43224) idle '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 heap_deform_tuple (tuple=tuple(at)entry=0x8,
tupleDesc=tupleDesc(at)entry=0x555713b2e638,
values=values(at)entry=0x555713b7ca48, isnull=isnull(at)entry=0x555713b7cf98 "")
at ./build/../src/backend/access/common/heaptuple.c:936
936 ./build/../src/backend/access/common/heaptuple.c: No such file or
directory.
(gdb) bt
#0 heap_deform_tuple (tuple=tuple(at)entry=0x8,
tupleDesc=tupleDesc(at)entry=0x555713b2e638,
values=values(at)entry=0x555713b7ca48, isnull=isnull(at)entry=0x555713b7cf98 "")
at ./build/../src/backend/access/common/heaptuple.c:936
#1 0x00005557127c918b in ReorderBufferToastReplace (rb=0x555713b7b788,
txn=0x555713b982e8, relation=0x7f4b3fafa248, relation=0x7f4b3fafa248,
change=0x555713b9a108)
at ./build/../src/backend/replication/logical/reorderbuffer.c:2985
#2 ReorderBufferCommit (rb=0x555713b7b788, xid=xid(at)entry=1667601527,
commit_lsn=commit_lsn(at)entry=41707737345720,
end_lsn=end_lsn(at)entry=41707737345792,
commit_time=commit_time(at)entry=627188499326201,
origin_id=origin_id(at)entry=0, origin_lsn=0) at
./build/../src/backend/replication/logical/reorderbuffer.c:1585
#3 0x00005557127bef32 in DecodeCommit (xid=1667601527,
parsed=0x7ffd288fe4a0, buf=<synthetic pointer>, ctx=0x555713a9ca08) at
./build/../src/backend/replication/logical/decode.c:614
#4 DecodeXactOp (buf=<synthetic pointer>, ctx=0x555713a9ca08) at
./build/../src/backend/replication/logical/decode.c:244
#5 LogicalDecodingProcessRecord (ctx=0x555713a9ca08, record=<optimized
out>) at ./build/../src/backend/replication/logical/decode.c:113
#6 0x00005557127d0b86 in XLogSendLogical () at
./build/../src/backend/replication/walsender.c:2797
#7 0x00005557127d232f in WalSndLoop
(send_data=send_data(at)entry=0x5557127d0b10 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2157
#8 0x00005557127d30aa in StartLogicalReplication (cmd=0x555713b499d8) at
./build/../src/backend/replication/walsender.c:1105
#9 exec_replication_command (cmd_string=cmd_string(at)entry=0x555713b4da38
"START_REPLICATION SLOT \"pgprod_office_basic\" LOGICAL 25EE/D6DD72D0
(proto_version '1', publication_names '\"pgprod_master_basic\"')")
at ./build/../src/backend/replication/walsender.c:1536
#10 0x000055571281de3a in PostgresMain (argc=<optimized out>,
argv=argv(at)entry=0x555713add8d8, dbname=<optimized out>, username=<optimized
out>) at ./build/../src/backend/tcop/postgres.c:4113
#11 0x000055571255c72c in BackendRun (port=0x555713ad56d0) at
./build/../src/backend/postmaster/postmaster.c:4408
#12 BackendStartup (port=0x555713ad56d0) at
./build/../src/backend/postmaster/postmaster.c:4080
#13 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1756
#14 0x00005557127ada21 in PostmasterMain (argc=9, argv=0x555713a7d190) at
./build/../src/backend/postmaster/postmaster.c:1364
#15 0x000055571255e422 in main (argc=9, argv=0x555713a7d190) at
./build/../src/backend/main/main.c:228
(gdb) quit

This behaviour does not depends on defined data in tables, because we see it
in different database with different sets of tables in publications.
Looks like a real issue in logical replication.
I will happy to provide an additional information about that issue, but i
should know what else to need to collect for helping to solve this
problem.

Regards,
Andrey.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Gmail 2019-11-18 19:08:29 Re: BUG #16119: pg_dump omits columns specification for matviews
Previous Message Tom Lane 2019-11-18 16:47:55 Re: BUG #16121: 12 regression: Volatile function in target list subquery behave as stable