Re: BUG #16125: Crash of PostgreSQL's wal sender during logical replication

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: andrey(dot)salnikov(at)dataegret(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16125: Crash of PostgreSQL's wal sender during logical replication
Date: 2019-11-18 22:24:16
Message-ID: 20191118222416.dkn5cdmbxmtcemaf@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2019-11-18 21:58:16 +0100, Tomas Vondra wrote:
> and the ReorderBufferToastReplace does this:
>
> newtup = change->data.tp.newtuple;
>
> heap_deform_tuple(&newtup->tuple, desc, attrs, isnull);
>
> but that fails, because the tuple pointer happens to be 0x8, which is
> clearly bogus. Not sure where that comes from, I don't recognize that as
> a typical patter.

It indicates that change->data.tp.newtuple is NULL,
afaict. newtup->tuple boils down to
((char *) newtup->tuple) + offsetof(ReorderBufferTupleBuf, tuple)
and offsetof(ReorderBufferTupleBuf, tuple) is 0x8.

> Can you create a core dump (see [1]), and print 'change' and 'txn' in
> frame #2? I wonder if some the other fields are bogus too (but it can't
> be entirely true ...), and if the transaction got serialized.

Please print change and *change, both, please.

I suspect what's happening is that somehow a change that shouldn't have
toast changes - e.g. a DELETE - somehow has toast changes. Which then
triggers a failure in ReorderBufferToastReplace(), which expects
newtuple to be valid.

It's probably worthwhile to add an elog(ERROR) check for this, even if
this does not turn out to be the case.

> > This behaviour does not depends on defined data in tables, because we see it
> > in different database with different sets of tables in publications.
>
> I'm not sure I really believe that. Surely there has to be something
> special about your schema, or possibly access patter that triggers this
> bug in your environment and not elsewhere.

Yea. Are there any C triggers present? Any unusual extensions? Users of
the transaction hook, for example?

> > Looks like a real issue in logical replication.
> > I will happy to provide an additional information about that issue, but i
> > should know what else to need to collect for helping to solve this
> > problem.
> >
>
> Well, if you can create a reproducer, that'd be the best option, because
> then we can investigate locally instead of the ping-ping here.
>
> But if that's not possible, let's start with the schema and the
> additional information from the core file.
>
> I'd also like to see the contents of the WAL, particularly for the XID
> triggering this issue. Please run pg_waldump and see how much data is
> there for XID 1667601527. It does commit at 25EE/D6DE6EB8, not sure
> where it starts. It may have subtransactions, so don't do just grep.

Yea, that'd be helpful.

Greetings,

Andres Freund

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Adam Scott 2019-11-18 22:51:25 Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query
Previous Message Elvis Pranskevichus 2019-11-18 21:37:08 Re: BUG #16121: 12 regression: Volatile function in target list subquery behave as stable