BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: ienieghapheoghaiwida(at)xff(dot)cz
Subject: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 01:14:18
Message-ID: 16129-a0c0f48e71741e5f@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16129
Logged by: Ondrej Jirman
Email address: ienieghapheoghaiwida(at)xff(dot)cz
PostgreSQL version: 12.1
Operating system: Arch Linux
Description:

Hello,

I've upgraded my main PostgreSQL cluster from 11.5 to 12.1 via pg_dumpall
method and after a while I started getting segfault in logical replication
worker.

My setup is fairly vanilla, non-default options:

shared_buffers = 256MB
work_mem = 512MB
temp_buffers = 64MB
maintenance_work_mem = 4GB
effective_cache_size = 16GB
max_logical_replication_workers = 30
max_replication_slots = 30
max_worker_processes = 30
wal_level = logical

I have several databases that I subscribe to from this database cluster
using logical replication.

Replication of one of my databases (running on ARMv7 machine) started
segfaulting on the subscriber side (x86_64) like this:

#0 0x00007fc259739917 in __memmove_sse2_unaligned_erms () from
/usr/lib/libc.so.6
#1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>,
__dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
#2 tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235
#3 0x000055d033e94d32 in ExecFetchSlotHeapTuple
(slot=slot(at)entry=0x55d0356da3b8, materialize=materialize(at)entry=true,
shouldFree=shouldFree(at)entry=0x7fff0e7cf387) at execTuples.c:1624
#4 0x000055d033d272a1 in heapam_tuple_update (relation=0x7fc24573a228,
otid=0x55d0356dac38, slot=0x55d0356da3b8, cid=0, snapshot=<optimized out>,
crosscheck=0x0, wait=true, tmfd=0x7fff0e7cf400,
lockmode=0x7fff0e7cf3fc, update_indexes=0x7fff0e7cf457) at
heapam_handler.c:325
#5 0x000055d033d537cd in table_tuple_update (update_indexes=0x7fff0e7cf457,
lockmode=0x7fff0e7cf3fc, tmfd=0x7fff0e7cf400, wait=true, crosscheck=0x0,
snapshot=0x0, cid=<optimized out>,
slot=0x55d0356da3b8, otid=0x55d0356dac38, rel=0x7fc24573a228) at
../../../../src/include/access/tableam.h:1261
#6 simple_table_tuple_update (rel=rel(at)entry=0x7fc24573a228,
otid=otid(at)entry=0x55d0356dac38, slot=slot(at)entry=0x55d0356da3b8,
snapshot=0x0, update_indexes=update_indexes(at)entry=0x7fff0e7cf457)
at tableam.c:325
#7 0x000055d033e914c4 in ExecSimpleRelationUpdate
(estate=estate(at)entry=0x55d03572cff0, epqstate=epqstate(at)entry=0x7fff0e7cf4b0,
searchslot=searchslot(at)entry=0x55d0356dac08,
slot=slot(at)entry=0x55d0356da3b8) at execReplication.c:493
#8 0x000055d033f9158e in apply_handle_update (s=s(at)entry=0x7fff0e7d6a80) at
worker.c:750
#9 0x000055d033f91ab0 in apply_dispatch (s=s(at)entry=0x7fff0e7d6a80) at
worker.c:968
#10 0x000055d033f9206e in LogicalRepApplyLoop (last_received=11741462160) at
worker.c:1176
#11 0x000055d033f927c4 in ApplyWorkerMain (main_arg=<optimized out>) at
worker.c:1734
#12 0x000055d033f601d5 in StartBackgroundWorker () at bgworker.c:834
#13 0x000055d033f6e818 in do_start_bgworker (rw=rw(at)entry=0x55d035620780) at
postmaster.c:5770
#14 0x000055d033f6e9a4 in maybe_start_bgworkers () at postmaster.c:5996
#15 0x000055d033f6f216 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at postmaster.c:5167
#16 <signal handler called>
#17 0x00007fc259787e2a in select () from /usr/lib/libc.so.6
#18 0x000055d033f6fb37 in ServerLoop () at postmaster.c:1668
#19 0x000055d033f70c5a in PostmasterMain (argc=3, argv=<optimized out>) at
postmaster.c:1377
#20 0x000055d033ed9731 in main (argc=3, argv=0x55d0355f3260) at main.c:228

Subscriber:

2019-11-20 23:22:50.315 CET [1480050] LOG: data stream from publisher has
ended
2019-11-20 23:22:50.315 CET [1480050] ERROR: could not send
end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.318 CET [178527] LOG: background worker "logical
replication worker" (PID 1480050) exited with exit code 1
2019-11-20 23:22:50.320 CET [1480051] LOG: data stream from publisher has
ended
2019-11-20 23:22:50.320 CET [1480051] ERROR: could not send
end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.321 CET [178527] LOG: background worker "logical
replication worker" (PID 1480051) exited with exit code 1
2019-11-20 23:22:50.321 CET [2195905] LOG: logical replication apply worker
for subscription "l5_hometv" has started
2019-11-20 23:22:50.325 CET [1480052] LOG: data stream from publisher has
ended
2019-11-20 23:22:50.325 CET [1480052] ERROR: could not send
end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.326 CET [178527] LOG: background worker "logical
replication worker" (PID 1480052) exited with exit code 1
2019-11-20 23:22:50.395 CET [2195905] ERROR: could not connect to the
publisher: FATAL: the database system is shutting down
FATAL: the database system is shutting down
2019-11-20 23:22:50.396 CET [178527] LOG: background worker "logical
replication worker" (PID 2195905) exited with exit code 1
2019-11-20 23:22:55.401 CET [2195919] LOG: logical replication apply worker
for subscription "l5_hometv" has started
2019-11-20 23:22:55.411 CET [2195920] LOG: logical replication apply worker
for subscription "l5_ledger" has started
2019-11-20 23:22:55.421 CET [2195921] LOG: logical replication apply worker
for subscription "l5_news" has started
2019-11-20 23:28:52.488 CET [178527] LOG: background worker "logical
replication worker" (PID 2195919) was terminated by signal 11: Neoprávněný
přístup do pam>
2019-11-20 23:28:52.488 CET [178527] LOG: terminating any other active
server processes
2019-11-20 23:28:52.488 CET [2195470] WARNING: terminating connection
because of crash of another server process
2019-11-20 23:28:52.488 CET [2195470] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another >
2019-11-20 23:28:52.488 CET [2195470] HINT: In a moment you should be able
to reconnect to the database and repeat your command.

Logs for publisher and subscriber follow. I think that the initial segfault
on the subscriber was tirggered by restart of the publisher machine. From
that point forward, subscriber segfaults on each connection to the
publisher. It may or may not be relevant to the upgrade, but looks like the
code in execTuples.c was changed quite a bit in 12.x, so it may.

Publisher:

lis 20 23:29:02 l5 postgres[355]: 2019-11-20 23:29:02.983 CET [466] LOG:
unexpected EOF on standby connection
lis 20 23:29:02 l5 postgres[355]: 2019-11-20 23:29:02.983 CET [467] LOG:
unexpected EOF on standby connection
lis 20 23:29:02 l5 postgres[355]: 2019-11-20 23:29:02.982 CET [466] LOG:
could not receive data from client: Spojení zrušeno druhou stranou
lis 20 23:29:02 l5 postgres[355]: 2019-11-20 23:29:02.982 CET [467] LOG:
could not receive data from client: Spojení zrušeno druhou stranou
lis 20 23:29:02 l5 postgres[355]: 2019-11-20 23:29:02.980 CET [465] LOG:
unexpected EOF on standby connection
lis 20 23:29:02 l5 postgres[355]: 2019-11-20 23:29:02.980 CET [465] LOG:
could not receive data from client: Spojení zrušeno druhou stranou
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.588 CET [467] DETAIL:
There are no running transactions.
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.588 CET [467] LOG:
logical decoding found consistent point at 2/BBD880E8
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.587 CET [467] DETAIL:
Streaming transactions committing after 2/BBD88120, reading WAL from
2/BBD880E8.
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.587 CET [467] LOG:
starting logical decoding for slot "l5_news"
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.579 CET [466] DETAIL:
There are no running transactions.
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.579 CET [466] LOG:
logical decoding found consistent point at 2/BBD880E8
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.578 CET [466] DETAIL:
Streaming transactions committing after 2/BBD88120, reading WAL from
2/BBD880E8.
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.578 CET [466] LOG:
starting logical decoding for slot "l5_ledger"
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.572 CET [465] DETAIL:
There are no running transactions.
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.572 CET [465] LOG:
logical decoding found consistent point at 2/BBD86E68
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.570 CET [465] DETAIL:
Streaming transactions committing after 2/BBD86EA0, reading WAL from
2/BBD86E68.
lis 20 23:28:54 l5 postgres[355]: 2019-11-20 23:28:54.570 CET [465] LOG:
starting logical decoding for slot "l5_hometv"
lis 20 23:28:52 l5 postgres[355]: 2019-11-20 23:28:52.495 CET [385] LOG:
unexpected EOF on standby connection
lis 20 23:28:52 l5 postgres[355]: 2019-11-20 23:28:52.495 CET [385] LOG:
could not receive data from client: Spojení zrušeno druhou stranou
lis 20 23:28:52 l5 postgres[355]: 2019-11-20 23:28:52.491 CET [384] LOG:
unexpected EOF on standby connection
lis 20 23:28:52 l5 postgres[355]: 2019-11-20 23:28:52.491 CET [384] LOG:
could not receive data from client: Spojení zrušeno druhou stranou
lis 20 23:28:52 l5 postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG:
unexpected EOF on standby connection
lis 20 23:28:52 l5 postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG:
could not receive data from client: Spojení zrušeno druhou stranou

lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.107 CET [386] DETAIL:
There are no running transactions.
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.107 CET [386] LOG:
logical decoding found consistent point at 2/BBD86CE0
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.107 CET [384] DETAIL:
There are no running transactions.
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.107 CET [384] LOG:
logical decoding found consistent point at 2/BBD86CE0
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.107 CET [385] DETAIL:
There are no running transactions.
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.107 CET [385] LOG:
logical decoding found consistent point at 2/BBD86CE0
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.106 CET [386] DETAIL:
Streaming transactions committing after 2/BBD86CE0, reading WAL from
2/BBD86CE0.
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.106 CET [386] LOG:
starting logical decoding for slot "l5_hometv"
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.106 CET [384] DETAIL:
Streaming transactions committing after 2/BBD86CE0, reading WAL from
2/BBD86CE0.
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.106 CET [384] LOG:
starting logical decoding for slot "l5_ledger"
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.106 CET [385] DETAIL:
Streaming transactions committing after 2/BBD86CE0, reading WAL from
2/BBD86CE0.
lis 20 23:22:58 l5 postgres[355]: 2019-11-20 23:22:58.106 CET [385] LOG:
starting logical decoding for slot "l5_news"
lis 20 23:22:54 l5 systemd[1]: Started PostgreSQL database server.
lis 20 23:22:54 l5 postgres[355]: 2019-11-20 23:22:54.295 CET [355] LOG:
database system is ready to accept connections
lis 20 23:22:54 l5 postgres[355]: 2019-11-20 23:22:54.146 CET [356] LOG:
database system was shut down at 2019-11-20 23:22:50 CET
lis 20 23:22:54 l5 systemd-networkd[326]: eth0: Gained IPv6LL
lis 20 23:22:53 l5 postgres[355]: 2019-11-20 23:22:53.544 CET [355] LOG:
listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
lis 20 23:22:53 l5 postgres[355]: 2019-11-20 23:22:53.539 CET [355] LOG:
listening on IPv6 address "::", port 5432
lis 20 23:22:53 l5 postgres[355]: 2019-11-20 23:22:53.539 CET [355] LOG:
listening on IPv4 address "0.0.0.0", port 5432

thank you and regards,
Ondrej

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2019-11-21 04:45:03 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message Tom Lane 2019-11-20 18:16:22 Re: ALTER TABLE fails when changing column type due to index with bit_ops opclass