Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 23:47:01
Message-ID: 20191121234701.2wtyc77zd5ux3myf@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 05:37:25PM -0500, Tom Lane wrote:
>Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>> FWIW my hunch is the bug is somewhere in this chunk of code from
>> apply_heap_update:
>
>> oldctx = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
>> ExecCopySlot(remoteslot, localslot);
>> slot_modify_cstrings(remoteslot, rel, newtup.values, newtup.changed);
>> MemoryContextSwitchTo(oldctx);
>
>My first reaction to looking at this was that ExecCopySlot() could
>be dropped entirely, because the first thing that slot_modify_cstrings
>does is
>
> slot_getallattrs(slot);
> ExecClearTuple(slot);
>
>That slot_getallattrs call seems 100% useless as well, because
>surely ExecClearTuple is just going to drop all the data.
>

I don't think that's quite true. After the ExecCopySlot call, the
pass-by-ref Datums in remoteslot will point to a tuple attached to
localslot. But it does not pass the tuple 'ownership' to the remoteslot,
i.e. the flag TTS_FLAG_SHOULDFREE won't be set, i.e. the tuple won't be
freed.

>On closer inspection, though, it looks like the author of this
>code is relying on ExecClearTuple to *not* destroy the data,
>because the subsequent loop might overwrite only some of the
>slot's columns, before it does
>
> ExecStoreVirtualTuple(slot);
>
>which supposes that all the columns are valid.
>
>But of course this is 100% broken, because whatever ExecClearTuple
>may have done or not done to the slot's datum/isnull arrays, it
>certainly pfree'd the slot's physical tuple. So any pass-by-ref
>datums are now dangling pointers.
>
>I imagine the only reason this code has gotten past the valgrind
>animals is that we're not testing cases where non-replaced columns
>in the subscriber table are of pass-by-ref types.
>

I haven't checked, but I'd imagine we actually do such tests. I've
however tried to reproduce this, unsuccessfully.

>I draw a direct line between the existence of this bug and the lack
>of commentary in slot_modify_cstrings about what it's doing. If the
>author had troubled to write a comment explaining these machinations,
>he might've noticed the bug, or at least reviewers might have.
>

Not sure. More comments would not hurt, but I don't see any bug yet.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-11-21 23:56:47 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message mayur 2019-11-21 23:35:09 Re: BUG #16130: planner does not pick unique btree index and goes for seq scan but unsafe hash index works.