Re: BUG #16125: Crash of PostgreSQL's wal sender during logical replication

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: andrey(dot)salnikov(at)dataegret(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16125: Crash of PostgreSQL's wal sender during logical replication
Date: 2019-11-18 20:58:16
Message-ID: 20191118205816.5wne6hxedrm7zd5p@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Nov 18, 2019 at 06:41:43PM +0000, PG Bug reporting form wrote:
>The following bug has been logged on the website:
>
>Bug reference: 16125
>Logged by: Andrey Salnikov
>Email address: andrey(dot)salnikov(at)dataegret(dot)com
>PostgreSQL version: 10.11
>Operating system: Ubuntu 18.04.2 LTS
>Description:
>
>Hi,
>PostgreSQL's wal sender begins continuously crash during logical
>replication. Each start after crash sender tries to decode the same part of
>wal and crash again and again.
>

Thanks for the report. Unfortunately this does not tell us much, except
that it crashed :-(

>Log contains next lines:
>2019-11-16 03:01:39.512 UTC 22235 @ from [vxid: txid:0] [] LOG: server
>process (PID 26972) was terminated by signal 11: Segmentation fault
>--
> ...
>
>[idle] DETAIL: There are no running transactions.
>2019-11-16 03:03:18.949 UTC 22235 @ from [vxid: txid:0] [] LOG: server
>process (PID 10112) was terminated by signal 11: Segmentation fault
>

This is more interesting:

>Core dump contains:
>Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols
>from
>/usr/lib/debug/.build-id/92/b3b790a5135245f2cbb2a0a2f48fc7ff9f689a.debug...done.
>done.
>[New LWP 13678]
>[Thread debugging using libthread_db enabled]
>Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>Core was generated by `postgres: 10/main: wal sender process replication
>1.1.1.1(43224) idle '.
>Program terminated with signal SIGSEGV, Segmentation fault.
>#0 heap_deform_tuple (tuple=tuple(at)entry=0x8,
>tupleDesc=tupleDesc(at)entry=0x555713b2e638,
>values=values(at)entry=0x555713b7ca48, isnull=isnull(at)entry=0x555713b7cf98 "")
>at ./build/../src/backend/access/common/heaptuple.c:936
>936 ./build/../src/backend/access/common/heaptuple.c: No such file or
>directory.
>(gdb) bt
>#0 heap_deform_tuple (tuple=tuple(at)entry=0x8,
>tupleDesc=tupleDesc(at)entry=0x555713b2e638,
>values=values(at)entry=0x555713b7ca48, isnull=isnull(at)entry=0x555713b7cf98 "")
>at ./build/../src/backend/access/common/heaptuple.c:936
>#1 0x00005557127c918b in ReorderBufferToastReplace (rb=0x555713b7b788,
>txn=0x555713b982e8, relation=0x7f4b3fafa248, relation=0x7f4b3fafa248,
>change=0x555713b9a108)
> at ./build/../src/backend/replication/logical/reorderbuffer.c:2985
>#2 ReorderBufferCommit (rb=0x555713b7b788, xid=xid(at)entry=1667601527,
>commit_lsn=commit_lsn(at)entry=41707737345720,
>end_lsn=end_lsn(at)entry=41707737345792,
>commit_time=commit_time(at)entry=627188499326201,
> origin_id=origin_id(at)entry=0, origin_lsn=0) at
>./build/../src/backend/replication/logical/reorderbuffer.c:1585

OK, so the issue seems to be related to handling TOASTed values, i.e. we
do this:

/* user-triggered change */
if (!IsToastRelation(relation))
{
ReorderBufferToastReplace(rb, txn, relation, change);
rb->apply_change(rb, txn, relation, change);

/*
* Only clear reassembled toast chunks if we're sure
* they're not required anymore. The creator of the
* tuple tells us.
*/
if (change->data.tp.clear_toast_afterwards)
ReorderBufferToastReset(rb, txn);
}

and the ReorderBufferToastReplace does this:

newtup = change->data.tp.newtuple;

heap_deform_tuple(&newtup->tuple, desc, attrs, isnull);

but that fails, because the tuple pointer happens to be 0x8, which is
clearly bogus. Not sure where that comes from, I don't recognize that as
a typical patter.

Can you create a core dump (see [1]), and print 'change' and 'txn' in
frame #2? I wonder if some the other fields are bogus too (but it can't
be entirely true ...), and if the transaction got serialized.

You may need to install debug symbols first, BTW.

>
>This behaviour does not depends on defined data in tables, because we see it
>in different database with different sets of tables in publications.

I'm not sure I really believe that. Surely there has to be something
special about your schema, or possibly access patter that triggers this
bug in your environment and not elsewhere.

Do you see it on different PostgreSQL instances? Can you share the
schema definition?

>Looks like a real issue in logical replication.
>I will happy to provide an additional information about that issue, but i
>should know what else to need to collect for helping to solve this
>problem.
>

Well, if you can create a reproducer, that'd be the best option, because
then we can investigate locally instead of the ping-ping here.

But if that's not possible, let's start with the schema and the
additional information from the core file.

I'd also like to see the contents of the WAL, particularly for the XID
triggering this issue. Please run pg_waldump and see how much data is
there for XID 1667601527. It does commit at 25EE/D6DE6EB8, not sure
where it starts. It may have subtransactions, so don't do just grep.

regards

[1] https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Elvis Pranskevichus 2019-11-18 21:37:08 Re: BUG #16121: 12 regression: Volatile function in target list subquery behave as stable
Previous Message Thushara Wijeratna 2019-11-18 19:52:15 Re: BUG #16123: DST not respected for America/Sao_Paulo in `timestamp` function