Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 16:57:52
Message-ID: 20191121165752.dffge6bh756xlfdg@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 05:15:02PM +0100, Ondřej Jirman wrote:
>On Thu, Nov 21, 2019 at 04:57:07PM +0100, Ondřej Jirman wrote:
>>
>> Maybe it has something to do with my upgrade method. I
>> dumped/restored the replica with pg_dumpall, and then just proceded
>> to enable subscription and refresh publication with (copy_data=false)
>> for all my subscriptions.
>
>OTOH, it may not. There are 2 more databases replicated the same way
>from the same database cluster, and they don't crash the replica
>server, and continue replicating. The one of the other databases also
>has bytea columns in some of the tables.
>
>It really just seems related to the machine restart (a regular one)
>that I did on the primary, minutes later replica crashed, and kept
>crashing ever since whenever connecting to the primary for the hometv
>database.
>

Hmmm. A restart of the primary certainly should not cause any such
damage, that'd be a bug too. And it'd be a bit strange that it correctly
sends the data and it crashes the replica. How exactly did you restart
the primary? What mode - smart/fast/immediate?

>So maybe something's wrong with the replica database (maybe because the
>connection got killed by the walsender at unfortunate time), rather
>than the original database, because I can replicate the original DB
>afresh into a new copy just fine and other databases continue
>replicating just fine if I disable the crashing subscription.
>

Possibly, but what would be the damaged bit? The only thing I can think
of is the replication slot info (i.e. snapshot), and I know there were
some timing issues in the serialization.

How far is the change from the restart point of the slot (visible in
pg_replication_slots)? If there are many changes since then, that'd mean
the corrupted snapshot is unlikely.

There's a lot of moving parts in this - you're replicating between major
versions, and from ARM to x86. All of that should work, of course, but
maybe there's a bug somewhere. So it might take time to investigate and
fix. Thanks for you patience ;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2019-11-21 17:01:52 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message PG Bug reporting form 2019-11-21 16:46:58 BUG #16130: planner does not pick unique btree index and goes for seq scan but unsafe hash index works.