Re: BUG #16226: background worker "logical replication worker" (PID <pid>) was terminated by signal 11: Segmentation

From: Vadim Yatsenko <vadim(at)postgrespro(dot)co(dot)il>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16226: background worker "logical replication worker" (PID <pid>) was terminated by signal 11: Segmentation
Date: 2020-01-23 09:03:02
Message-ID: CAJTwZ8w-o6QwL-4v=-jjCWDtB2UDA1KY05GCRYQ+6fbvR2ErZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom,

Thanks you! We'll wait patch to update our servers.

Best Regards,
Vadim Yatsenko

ср, 22 янв. 2020 г., 18:28 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> > We have 2 PostgreSQL servers with logical replication between Postgres
> 11.6
> > (Primary) and 12.1 (Logical). Some times ago, we changed column type in
> a 2
> > big tables from integer to text:
> > ...
> > , this of course led to a full rewrite both tables. We repated this
> > operation on both servers. And after that we started to get error like
> > "background worker "logical replication worker" (PID <pid>) was
> terminated
> > by signal 11: Segmentation fault" and server goes to recovery mode.
>
> Not sure, but this seems like it might be explained by this recent
> bug fix:
>
>
> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Branch: master [4d9ceb001] 2019-11-22 11:31:19 -0500
> Branch: REL_12_STABLE [a2aa224e0] 2019-11-22 11:31:19 -0500
> Branch: REL_11_STABLE [b72a44c51] 2019-11-22 11:31:19 -0500
> Branch: REL_10_STABLE [5d3fcb53a] 2019-11-22 11:31:19 -0500
>
> Fix bogus tuple-slot management in logical replication UPDATE handling.
>
> slot_modify_cstrings seriously abused the TupleTableSlot API by relying
> on a slot's underlying data to stay valid across ExecClearTuple. Since
> this abuse was also quite undocumented, it's little surprise that the
> case got broken during the v12 slot rewrites. As reported in bug
> #16129
> from Ondřej Jirman, this could lead to crashes or data corruption when
> a logical replication subscriber processes a row update. Problems
> would
> only arise if the subscriber's table contained columns of pass-by-ref
> types that were not being copied from the publisher.
>
> Fix by explicitly copying the datum/isnull arrays from the source slot
> that the old row was in already. This ends up being about the same
> thing that happened pre-v12, but hopefully in a less opaque and
> fragile way.
>
> We might've caught the problem sooner if there were any test cases
> dealing with updates involving non-replicated or dropped columns.
> Now there are.
>
> Back-patch to v10 where this code came in. Even though the failure
> does not manifest before v12, IMO this code is too fragile to leave
> as-is. In any case we certainly want the additional test coverage.
>
> Patch by me; thanks to Tomas Vondra for initial investigation.
>
> Discussion: https://postgr.es/m/16129-a0c0f48e71741e5f@postgresql.org
>
> regards, tom lane
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Daniel Gustafsson 2020-01-23 09:55:53 Re: Query will execute when inner query have issue
Previous Message selva kumar 2020-01-23 07:34:40 Query will execute when inner query have issue