Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 12:59:30
Message-ID: 20191121125930.or5fhpi2yjgkmjye@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 12:53:26PM +0100, Ondřej Jirman wrote:
>Hello,
>
>On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote:
>> On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote:
>> > The following bug has been logged on the website:
>> >
>> > Bug reference: 16129
>> > Logged by: Ondrej Jirman
>> > Email address: ienieghapheoghaiwida(at)xff(dot)cz
>> > PostgreSQL version: 12.1
>> > Operating system: Arch Linux
>> > Description:
>> >
>> > Hello,
>> >
>> > I've upgraded my main PostgreSQL cluster from 11.5 to 12.1 via pg_dumpall
>> > method and after a while I started getting segfault in logical replication
>> > worker.
>> >
>> > My setup is fairly vanilla, non-default options:
>> >
>> > shared_buffers = 256MB
>> > work_mem = 512MB
>> > temp_buffers = 64MB
>> > maintenance_work_mem = 4GB
>> > effective_cache_size = 16GB
>> > max_logical_replication_workers = 30
>> > max_replication_slots = 30
>> > max_worker_processes = 30
>> > wal_level = logical
>> >
>> > I have several databases that I subscribe to from this database cluster
>> > using logical replication.
>> >
>> > Replication of one of my databases (running on ARMv7 machine) started
>> > segfaulting on the subscriber side (x86_64) like this:
>> >
>> > #0 0x00007fc259739917 in __memmove_sse2_unaligned_erms () from
>> > /usr/lib/libc.so.6
>> > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>,
>> > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
>> > #2 tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235
>> > #3 0x000055d033e94d32 in ExecFetchSlotHeapTuple
>> > (slot=slot(at)entry=0x55d0356da3b8, materialize=materialize(at)entry=true,
>> > shouldFree=shouldFree(at)entry=0x7fff0e7cf387) at execTuples.c:1624
>
>I forgot to add that publisher is still PostgreSQL 11.5.
>
>> Hmmm, so it's failing on this memcpy() in tts_virtual_materialize:
>>
>> else
>> {
>> Size data_length = 0;
>>
>> data = (char *) att_align_nominal(data, att->attalign);
>> data_length = att_addlength_datum(data_length, att->attlen, val);
>>
>> memcpy(data, DatumGetPointer(val), data_length);
>
>This is a bytea column in one of the tables on the publisher.
>
>So this looks like it's segfaulting while trying to copy too much bytes
>(data_length determined as 620,701,425 bytes):
>
>#1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>, __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
>
>But maximum length of any bytea valaue in the publisher database is <200kB.
>

OK. So the public.videos.cover_image column, I suppose?

Anyway, this likely means the pointer bogus, in some way. Maybe it's
pointing to already freed memory, or something like that.

>>
>> slot->tts_values[natt] = PointerGetDatum(data);
>> data += data_length;
>> }
>>
>> The question is, which of the pointers is bogus. You seem to already
>> have a core file, so can you inspect the variables in frame #2? I think
>> especially
>>
>> p *slot
>> p natt
>> p val
>> p *att
>>
>> would be interesting to see.
>
>(gdb) p *slot
>$1 = {type = T_TupleTableSlot, tts_flags = 20, tts_nvalid = 8, tts_ops = 0x558149ff4da0 <TTSOpsVirtual>, tts_tupleDescriptor = 0x7fcca2ea7548, tts_values = 0x55814adfbc10,
> tts_isnull = 0x55814adfbc50, tts_mcxt = 0x55814adfb6e0, tts_tid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, tts_tableOid = 0}
>(gdb) p natt
>$2 = 2
>(gdb) p val
>$3 = <optimized out>
>(gdb) p slot->tts_values[nat]
>No symbol "nat" in current context.
>(gdb) p slot->tts_values[natt]
>$4 = 94013795319824
>(gdb) p *slot->tts_values[natt]
>$5 = -1812161596
>(gdb) p *att
>$6 = {attrelid = 55240, attname = {data = "cover_image", '\000' <repeats 52 times>}, atttypid = 17, attstattarget = -1, attlen = -1, attnum = 3, attndims = 0, attcacheoff = -1, atttypmod = -1,
> attbyval = false, attstorage = 120 'x', attalign = 105 'i', attnotnull = true, atthasdef = false, atthasmissing = false, attidentity = 0 '\000', attgenerated = 0 '\000', attisdropped = false,
> attislocal = true, attinhcount = 0, attcollation = 0}
>

OK, nothing obviously bobus here, I think. It's not entirely clear to me
why this works

(gdb) p *slot->tts_values[natt]
$5 = -1812161596

because tts_values[natt] is not a pointer, I think. So I'm not sure how
to interpret the dereference. It would be interesting to see the first
couple of bytes? I think somehing like

(gdb) p ((char*)slot->tts_values[natt])[$n]

for $n between 0 and 3 should do the trick.

>> Also, how does the replicated schema look like? Can we see the table
>> definitions?
>
> ...
>CREATE TABLE public.videos (
> id integer NOT NULL,
> title text NOT NULL,
> cover_image bytea NOT NULL,
> metadata jsonb NOT NULL,
> category integer NOT NULL,
> published date NOT NULL,
> added timestamp without time zone DEFAULT now() NOT NULL,
> played boolean DEFAULT false NOT NULL
>);

OK, so this is the problematic table. Any idea what operation triggers
the issue? Having a small reproducer (say, a single-row insert or so)
would be very helpful.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ondřej Jirman 2019-11-21 13:12:02 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message Ondřej Jirman 2019-11-21 12:59:13 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker