Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 17:35:55
Message-ID: 20191121173555.midj6vp4dmiiqs24@core.my.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 05:57:52PM +0100, Tomas Vondra wrote:
> On Thu, Nov 21, 2019 at 05:15:02PM +0100, Ondřej Jirman wrote:
> > On Thu, Nov 21, 2019 at 04:57:07PM +0100, Ondřej Jirman wrote:
> > >
> > > Maybe it has something to do with my upgrade method. I
> > > dumped/restored the replica with pg_dumpall, and then just proceded
> > > to enable subscription and refresh publication with (copy_data=false)
> > > for all my subscriptions.
> >
> > OTOH, it may not. There are 2 more databases replicated the same way
> > from the same database cluster, and they don't crash the replica
> > server, and continue replicating. The one of the other databases also
> > has bytea columns in some of the tables.
> >
> > It really just seems related to the machine restart (a regular one)
> > that I did on the primary, minutes later replica crashed, and kept
> > crashing ever since whenever connecting to the primary for the hometv
> > database.
> >
>
> Hmmm. A restart of the primary certainly should not cause any such
> damage, that'd be a bug too. And it'd be a bit strange that it correctly
> sends the data and it crashes the replica. How exactly did you restart
> the primary? What mode - smart/fast/immediate?

Looks like Arch Linux uses fast mode. KillSignal=SIGINT

Primary:
-------

23:22:31 - primary machine booted so postgresql shutdown was ~15s prior
23:22:53 - primary cluster started (saying
LOG: database system was shut down at 2019-11-20 23:22:50 CET)

(there was some RTC inaccuracy, corrected by NTP earlier during boot)
lis 20 23:22:36 l5 systemd-timesyncd[312]: System clock time unset or jumped backwards, restoring from recorded timestamp: Wed 2019-11-20 23:22:50 CET

postgres[355]: 2019-11-20 23:22:58.106 CET [386] LOG: starting logical decoding for slot "l5_hometv"

And this is when the replica crashed:

postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG: could not receive data from client: Spojení zrušeno druhou stranou
postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG: unexpected EOF on standby connection

Replica:
-------

2019-11-20 15:35:32.910 CET [1480049] LOG: logical replication apply worker for subscription "l3_weatherlog" has started
2019-11-20 15:35:32.921 CET [1480050] LOG: logical replication apply worker for subscription "l5_hometv" has started
2019-11-20 15:35:32.931 CET [1480051] LOG: logical replication apply worker for subscription "l5_ledger_dana" has started
2019-11-20 15:35:32.942 CET [1480052] LOG: logical replication apply worker for subscription "l5_news" has started

2019-11-20 23:22:50.315 CET [1480050] LOG: data stream from publisher has ended
2019-11-20 23:22:50.315 CET [1480050] ERROR: could not send end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.318 CET [178527] LOG: background worker "logical replication worker" (PID 1480050) exited with exit code 1
2019-11-20 23:22:50.320 CET [1480051] LOG: data stream from publisher has ended
2019-11-20 23:22:50.320 CET [1480051] ERROR: could not send end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.321 CET [178527] LOG: background worker "logical replication worker" (PID 1480051) exited with exit code 1
2019-11-20 23:22:50.321 CET [2195905] LOG: logical replication apply worker for subscription "l5_hometv" has started
2019-11-20 23:22:50.325 CET [1480052] LOG: data stream from publisher has ended
2019-11-20 23:22:50.325 CET [1480052] ERROR: could not send end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.326 CET [178527] LOG: background worker "logical replication worker" (PID 1480052) exited with exit code 1
2019-11-20 23:22:50.395 CET [2195905] ERROR: could not connect to the publisher: FATAL: the database system is shutting down
FATAL: the database system is shutting down
2019-11-20 23:22:50.396 CET [178527] LOG: background worker "logical replication worker" (PID 2195905) exited with exit code 1
2019-11-20 23:22:55.401 CET [2195919] LOG: logical replication apply worker for subscription "l5_hometv" has started

nothing in the log and then crash:

Nov 20 23:28:47 core kernel: postgres[2195919]: segfault at 562d3bc56291 ip 00007faa0a760917 sp 00007ffd1f93f0c8 error 4 in libc-2.30.so (deleted)[7faa0a6dd000+14d000]
2019-11-20 23:28:52.488 CET [178527] LOG: background worker "logical replication worker" (PID 2195919) was terminated by signal 11: Neoprávněný přístup do pam>
2019-11-20 23:28:52.488 CET [178527] LOG: terminating any other active server processes
2019-11-20 23:28:52.488 CET [2195470] WARNING: terminating connection because of crash of another server process
2019-11-20 23:28:52.488 CET [2195470] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another >

....

then it restart

2019-11-20 23:28:52.505 CET [178527] LOG: all server processes terminated; reinitializing
2019-11-20 23:28:52.720 CET [2279954] LOG: database system was interrupted; last known up at 2019-11-20 23:27:33 CET
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 1 to 0/3BBF52C8
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 2 to 2/BBCE38C0
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 3 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 4 to 2/BBD75448
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 5 to 0/55503B08
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 6 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 7 to 0/5561CE80
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 8 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 9 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 10 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG: database system was not properly shut down; automatic recovery in progress

did a redo once:

2019-11-20 23:28:54.463 CET [2279954] LOG: redo starts at 8/6D794050
2019-11-20 23:28:54.464 CET [2279954] LOG: invalid record length at 8/6D795370: wanted 24, got 0
2019-11-20 23:28:54.464 CET [2279954] LOG: redo done at 8/6D795338

and went into an endless crash restart loop

> > So maybe something's wrong with the replica database (maybe because the
> > connection got killed by the walsender at unfortunate time), rather
> > than the original database, because I can replicate the original DB
> > afresh into a new copy just fine and other databases continue
> > replicating just fine if I disable the crashing subscription.
> >
>
> Possibly, but what would be the damaged bit? The only thing I can think
> of is the replication slot info (i.e. snapshot), and I know there were
> some timing issues in the serialization.
>
> How far is the change from the restart point of the slot (visible in
> pg_replication_slots)? If there are many changes since then, that'd mean
> the corrupted snapshot is unlikely.

It's the same.

postgres=# select * from pg_replication_slots;

slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
----------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+-------------+---------------------
l5_news | pgoutput | logical | 16386 | news | f | f | | | 56463 | 2/BBF679B8 | 2/BBF67B78
l5_hometv | pgoutput | logical | 19030 | hometv | f | f | | | 56395 | 2/BBD86E68 | 2/BBD86E68
l5_hometv2 | pgoutput | logical | 19030 | hometv | f | f | | | 56463 | 2/BBF67B40 | 2/BBF67B78

> There's a lot of moving parts in this - you're replicating between major
> versions, and from ARM to x86. All of that should work, of course, but
> maybe there's a bug somewhere. So it might take time to investigate and
> fix. Thanks for you patience ;-)

Yeah, I readlize that. :) Thanks for your help.

One missing piece is what exactly is the contents of the outstanding output from
pgoutput plugin, that the replica crashes on and doesn't apply. Are there any
tools for inspecting the binary output from the pgoutput plugin? Maybe that can
provide a clue.

regards,
o.

> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ondřej Jirman 2019-11-21 17:37:17 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message Tomas Vondra 2019-11-21 17:01:52 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker