Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 13:12:02
Message-ID: 20191121131202.xhi5hxvdjjfvbmop@core.my.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 01:59:30PM +0100, Tomas Vondra wrote:
> On Thu, Nov 21, 2019 at 12:53:26PM +0100, Ondřej Jirman wrote:
> > Hello,
> >
> > On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote:
> > > On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote:
> > > > The following bug has been logged on the website:
> > > >
> > > > Bug reference: 16129
> > > > Logged by: Ondrej Jirman
> > > > Email address: ienieghapheoghaiwida(at)xff(dot)cz
> > > > PostgreSQL version: 12.1
> > > > Operating system: Arch Linux
> > > > Description:
> > > >
> > > > Hello,
> > > >
> > > > I've upgraded my main PostgreSQL cluster from 11.5 to 12.1 via pg_dumpall
> > > > method and after a while I started getting segfault in logical replication
> > > > worker.
> > > >
> > > > My setup is fairly vanilla, non-default options:
> > > >
> > > > shared_buffers = 256MB
> > > > work_mem = 512MB
> > > > temp_buffers = 64MB
> > > > maintenance_work_mem = 4GB
> > > > effective_cache_size = 16GB
> > > > max_logical_replication_workers = 30
> > > > max_replication_slots = 30
> > > > max_worker_processes = 30
> > > > wal_level = logical
> > > >
> > > > I have several databases that I subscribe to from this database cluster
> > > > using logical replication.
> > > >
> > > > Replication of one of my databases (running on ARMv7 machine) started
> > > > segfaulting on the subscriber side (x86_64) like this:
> > > >
> > > > #0 0x00007fc259739917 in __memmove_sse2_unaligned_erms () from
> > > > /usr/lib/libc.so.6
> > > > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>,
> > > > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
> > > > #2 tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235
> > > > #3 0x000055d033e94d32 in ExecFetchSlotHeapTuple
> > > > (slot=slot(at)entry=0x55d0356da3b8, materialize=materialize(at)entry=true,
> > > > shouldFree=shouldFree(at)entry=0x7fff0e7cf387) at execTuples.c:1624
> >
> > I forgot to add that publisher is still PostgreSQL 11.5.
> >
> > > Hmmm, so it's failing on this memcpy() in tts_virtual_materialize:
> > >
> > > else
> > > {
> > > Size data_length = 0;
> > >
> > > data = (char *) att_align_nominal(data, att->attalign);
> > > data_length = att_addlength_datum(data_length, att->attlen, val);
> > >
> > > memcpy(data, DatumGetPointer(val), data_length);
> >
> > This is a bytea column in one of the tables on the publisher.
> >
> > So this looks like it's segfaulting while trying to copy too much bytes
> > (data_length determined as 620,701,425 bytes):
> >
> > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>, __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
> >
> > But maximum length of any bytea valaue in the publisher database is <200kB.
> >
>
> OK. So the public.videos.cover_image column, I suppose?
>
> Anyway, this likely means the pointer bogus, in some way. Maybe it's
> pointing to already freed memory, or something like that.
>
> > >
> > > slot->tts_values[natt] = PointerGetDatum(data);
> > > data += data_length;
> > > }
> > >
> > > The question is, which of the pointers is bogus. You seem to already
> > > have a core file, so can you inspect the variables in frame #2? I think
> > > especially
> > >
> > > p *slot
> > > p natt
> > > p val
> > > p *att
> > >
> > > would be interesting to see.
> >
> > (gdb) p *slot
> > $1 = {type = T_TupleTableSlot, tts_flags = 20, tts_nvalid = 8, tts_ops = 0x558149ff4da0 <TTSOpsVirtual>, tts_tupleDescriptor = 0x7fcca2ea7548, tts_values = 0x55814adfbc10,
> > tts_isnull = 0x55814adfbc50, tts_mcxt = 0x55814adfb6e0, tts_tid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, tts_tableOid = 0}
> > (gdb) p natt
> > $2 = 2
> > (gdb) p val
> > $3 = <optimized out>
> > (gdb) p slot->tts_values[nat]
> > No symbol "nat" in current context.
> > (gdb) p slot->tts_values[natt]
> > $4 = 94013795319824
> > (gdb) p *slot->tts_values[natt]
> > $5 = -1812161596
> > (gdb) p *att
> > $6 = {attrelid = 55240, attname = {data = "cover_image", '\000' <repeats 52 times>}, atttypid = 17, attstattarget = -1, attlen = -1, attnum = 3, attndims = 0, attcacheoff = -1, atttypmod = -1,
> > attbyval = false, attstorage = 120 'x', attalign = 105 'i', attnotnull = true, atthasdef = false, atthasmissing = false, attidentity = 0 '\000', attgenerated = 0 '\000', attisdropped = false,
> > attislocal = true, attinhcount = 0, attcollation = 0}
> >
>
> OK, nothing obviously bobus here, I think. It's not entirely clear to me
> why this works
>
> (gdb) p *slot->tts_values[natt]
> $5 = -1812161596
>
> because tts_values[natt] is not a pointer, I think. So I'm not sure how
> to interpret the dereference. It would be interesting to see the first
> couple of bytes? I think somehing like

It's probably nothing, the value there just looked big enough be a pointer, so
I dereferenced it in gdb.

> (gdb) p ((char*)slot->tts_values[natt])[$n]
>
> for $n between 0 and 3 should do the trick.

(gdb) p ((char*)slot->tts_values[natt])[0]
$7 = -60 '\304'
(gdb) p ((char*)slot->tts_values[natt])[1]
$8 = -101 '\233'
(gdb) p ((char*)slot->tts_values[natt])[2]
$9 = -4 '\374'
(gdb) p ((char*)slot->tts_values[natt])[3]
$10 = -109 '\223'
(gdb) p ((char*)slot->tts_values[natt])[4]
$11 = 0 '\000'
(gdb) p ((char*)slot->tts_values[natt])[5]
$12 = 0 '\000'
(gdb) p ((char*)slot->tts_values[natt])[6]
$13 = -8 '\370'

> > > Also, how does the replicated schema look like? Can we see the table
> > > definitions?
> >
> > ...
> > CREATE TABLE public.videos (
> > id integer NOT NULL,
> > title text NOT NULL,
> > cover_image bytea NOT NULL,
> > metadata jsonb NOT NULL,
> > category integer NOT NULL,
> > published date NOT NULL,
> > added timestamp without time zone DEFAULT now() NOT NULL,
> > played boolean DEFAULT false NOT NULL
> > );
>
> OK, so this is the problematic table. Any idea what operation triggers
> the issue? Having a small reproducer (say, a single-row insert or so)
> would be very helpful.

Can something be extracted from the logical decoding data I sent in the other
e-mail? That seems to contain the failing operation, but I don't know how to
decode it off the top of my head from that binary representation.

regards,
o.

> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2019-11-21 13:32:37 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message Tomas Vondra 2019-11-21 12:59:30 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker