Re: BUG #16226: background worker "logical replication worker" (PID <pid>) was terminated by signal 11: Segmentation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: vadim(at)postgrespro(dot)co(dot)il
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16226: background worker "logical replication worker" (PID <pid>) was terminated by signal 11: Segmentation
Date: 2020-01-22 15:28:08
Message-ID: 7344.1579706888@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> We have 2 PostgreSQL servers with logical replication between Postgres 11.6
> (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2
> big tables from integer to text:
> ...
> , this of course led to a full rewrite both tables. We repated this
> operation on both servers. And after that we started to get error like
> "background worker "logical replication worker" (PID <pid>) was terminated
> by signal 11: Segmentation fault" and server goes to recovery mode.

Not sure, but this seems like it might be explained by this recent
bug fix:

Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Branch: master [4d9ceb001] 2019-11-22 11:31:19 -0500
Branch: REL_12_STABLE [a2aa224e0] 2019-11-22 11:31:19 -0500
Branch: REL_11_STABLE [b72a44c51] 2019-11-22 11:31:19 -0500
Branch: REL_10_STABLE [5d3fcb53a] 2019-11-22 11:31:19 -0500

Fix bogus tuple-slot management in logical replication UPDATE handling.

slot_modify_cstrings seriously abused the TupleTableSlot API by relying
on a slot's underlying data to stay valid across ExecClearTuple. Since
this abuse was also quite undocumented, it's little surprise that the
case got broken during the v12 slot rewrites. As reported in bug #16129
from Ondřej Jirman, this could lead to crashes or data corruption when
a logical replication subscriber processes a row update. Problems would
only arise if the subscriber's table contained columns of pass-by-ref
types that were not being copied from the publisher.

Fix by explicitly copying the datum/isnull arrays from the source slot
that the old row was in already. This ends up being about the same
thing that happened pre-v12, but hopefully in a less opaque and
fragile way.

We might've caught the problem sooner if there were any test cases
dealing with updates involving non-replicated or dropped columns.
Now there are.

Back-patch to v10 where this code came in. Even though the failure
does not manifest before v12, IMO this code is too fragile to leave
as-is. In any case we certainly want the additional test coverage.

Patch by me; thanks to Tomas Vondra for initial investigation.

Discussion: https://postgr.es/m/16129-a0c0f48e71741e5f@postgresql.org

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ruud van Asseldonk 2020-01-22 17:30:12 Re: High table creation rate results in “File exists” error
Previous Message Michael Paquier 2020-01-22 13:18:53 Re: BUG #16226: background worker "logical replication worker" (PID <pid>) was terminated by signal 11: Segmentation