Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 15:57:07
Message-ID: 20191121155707.pqkgjphh65o5fapz@core.my.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 04:27:32PM +0100, Tomas Vondra wrote:
> On Thu, Nov 21, 2019 at 04:07:16PM +0100, Ondřej Jirman wrote:
> > I've also managed to get valgrind output on the segfaulting process:
> >
> > https://megous.com/dl/tmp/valgrind-105524-109020.log
> >
> > Though it probably doesn't show much new.
> >
>
> Is postgres actually compiled with valgrind support? What does pg_config
> say?

I don't see any valgrind related options in configure --help, I repackaged it
just with CFLAGS set to -ggdb -Og.

CONFIGURE = '--prefix=/usr' '--mandir=/usr/share/man' '--datadir=/usr/share/postgresql' '--sysconfdir=/etc' '--with-gssapi' '--with-libxml' '--with-openssl' '--with-perl' '--with-python' '--with-tcl' '--with-pam' '--with-system-tzdata=/usr/share/zoneinfo' '--with-uuid=e2fs' '--with-icu' '--with-systemd' '--with-ldap' '--enable-nls' '--enable-thread-safety' '--disable-rpath' 'PYTHON=/usr/bin/python2' 'CFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug' 'LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now' 'CPPFLAGS=-D_FORTIFY_SOURCE=2' 'CXXFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug'
CC = gcc
CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Wl,--as-needed
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm
VERSION = PostgreSQL 12.1

> If you're building the server locally (I don't think there are packages
> with valgrind enabled), maybe try adding --enable-cassert option and
>
> CFLAGS"-DRANDOMIZE_ALLOCATED_MEMORY"
>
> which will do more extensive checks at runtime.

Done.

CONFIGURE = '--prefix=/usr' '--mandir=/usr/share/man' '--datadir=/usr/share/postgresql' '--sysconfdir=/etc' '--with-gssapi' '--with-libxml' '--with-openssl' '--with-perl' '--with-python' '--with-tcl' '--with-pam' '--with-system-tzdata=/usr/share/zoneinfo' '--with-uuid=e2fs' '--with-icu' '--with-systemd' '--with-ldap' '--enable-nls' '--enable-thread-safety' '--disable-rpath' '--enable-cassert' 'PYTHON=/usr/bin/python2' 'CFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug -DRANDOMIZE_ALLOCATED_MEMORY' 'LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now' 'CPPFLAGS=-D_FORTIFY_SOURCE=2' 'CXXFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug'
CC = gcc
CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Og -ggdb -fdebug-prefix-map=/workspace/external/pgsql/postgresql-pkg/src=/usr/src/debug -DRANDOMIZE_ALLOCATED_MEMORY
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Wl,--as-needed
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm
VERSION = PostgreSQL 12.1

The difference now is that _len argument value is not stable accross crashes.

one crash...

#0 0x00007fb90c336917 in __memmove_sse2_unaligned_erms () from /usr/lib/libc.so.6
#1 0x000055ff2a7d0e64 in memcpy (__len=972662513, __src=<optimized out>, __dest=0x55ff2cc7dce4) at /usr/include/bits/string_fortified.h:34
#2 tts_virtual_materialize (slot=0x55ff2cc7d868) at execTuples.c:235
#3 0x000055ff2a7d2a63 in ExecFetchSlotHeapTuple (slot=slot(at)entry=0x55ff2cc7d868, materialize=materialize(at)entry=true, shouldFree=shouldFree(at)entry=0x7ffc80cfc607) at execTuples.c:1624
#4 0x000055ff2a634809 in heapam_tuple_update (relation=0x7fb8f8338100, otid=0x55ff2cc7e138, slot=0x55ff2cc7d868, cid=0, snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7ffc80cfc680,
lockmode=0x7ffc80cfc67c, update_indexes=0x7ffc80cfc6d7) at heapam_handler.c:325
#5 0x000055ff2a675d55 in table_tuple_update (update_indexes=0x7ffc80cfc6d7, lockmode=0x7ffc80cfc67c, tmfd=0x7ffc80cfc680, wait=true, crosscheck=0x0, snapshot=0x0, cid=<optimized out>,
slot=0x55ff2cc7d868, otid=0x55ff2cc7e138, rel=0x7fb8f8338100) at ../../../../src/include/access/tableam.h:1261
#6 simple_table_tuple_update (rel=rel(at)entry=0x7fb8f8338100, otid=otid(at)entry=0x55ff2cc7e138, slot=slot(at)entry=0x55ff2cc7d868, snapshot=0x0, update_indexes=update_indexes(at)entry=0x7ffc80cfc6d7)
at tableam.c:325
#7 0x000055ff2a7cdf08 in ExecSimpleRelationUpdate (estate=estate(at)entry=0x55ff2ccd02b8, epqstate=epqstate(at)entry=0x7ffc80cfc730, searchslot=searchslot(at)entry=0x55ff2cc7e108,
slot=slot(at)entry=0x55ff2cc7d868) at execReplication.c:493
#8 0x000055ff2a8eb8b2 in apply_handle_update (s=s(at)entry=0x7ffc80d03d00) at worker.c:750
#9 0x000055ff2a8ebe62 in apply_dispatch (s=s(at)entry=0x7ffc80d03d00) at worker.c:968
#10 0x000055ff2a8ec5ba in LogicalRepApplyLoop (last_received=11741462160) at worker.c:1176
#11 0x000055ff2a8ecd1a in ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
#12 0x000055ff2a8b3927 in StartBackgroundWorker () at bgworker.c:834
#13 0x000055ff2a8c2b6b in do_start_bgworker (rw=rw(at)entry=0x55ff2cbbf430) at postmaster.c:5770
#14 0x000055ff2a8c2cf7 in maybe_start_bgworkers () at postmaster.c:5996
#15 0x000055ff2a8c3678 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167
#16 <signal handler called>
#17 0x00007fb90c384e2a in select () from /usr/lib/libc.so.6
#18 0x000055ff2a8c40f8 in ServerLoop () at postmaster.c:1668
#19 0x000055ff2a8c54fc in PostmasterMain (argc=3, argv=<optimized out>) at postmaster.c:1377
#20 0x000055ff2a81f88a in main (argc=3, argv=0x55ff2cb94260) at main.c:228

and another ...

#0 0x00007f7a1befc917 in __memmove_sse2_unaligned_erms () from /usr/lib/libc.so.6
#1 0x0000559a5439ae64 in memcpy (__len=766338801, __src=<optimized out>, __dest=0x559a55626934) at /usr/include/bits/string_fortified.h:34
#2 tts_virtual_materialize (slot=0x559a556264b8) at execTuples.c:235
#3 0x0000559a5439ca63 in ExecFetchSlotHeapTuple (slot=slot(at)entry=0x559a556264b8, materialize=materialize(at)entry=true, shouldFree=shouldFree(at)entry=0x7fff7f2d5787) at execTuples.c:1624
#4 0x0000559a541fe809 in heapam_tuple_update (relation=0x7f7a07efe100, otid=0x559a55626d88, slot=0x559a556264b8, cid=0, snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7fff7f2d5800,
lockmode=0x7fff7f2d57fc, update_indexes=0x7fff7f2d5857) at heapam_handler.c:325
#5 0x0000559a5423fd55 in table_tuple_update (update_indexes=0x7fff7f2d5857, lockmode=0x7fff7f2d57fc, tmfd=0x7fff7f2d5800, wait=true, crosscheck=0x0, snapshot=0x0, cid=<optimized out>,
slot=0x559a556264b8, otid=0x559a55626d88, rel=0x7f7a07efe100) at ../../../../src/include/access/tableam.h:1261
#6 simple_table_tuple_update (rel=rel(at)entry=0x7f7a07efe100, otid=otid(at)entry=0x559a55626d88, slot=slot(at)entry=0x559a556264b8, snapshot=0x0, update_indexes=update_indexes(at)entry=0x7fff7f2d5857)
at tableam.c:325
#7 0x0000559a54397f08 in ExecSimpleRelationUpdate (estate=estate(at)entry=0x559a55678dc8, epqstate=epqstate(at)entry=0x7fff7f2d58b0, searchslot=searchslot(at)entry=0x559a55626d58,
slot=slot(at)entry=0x559a556264b8) at execReplication.c:493
#8 0x0000559a544b58b2 in apply_handle_update (s=s(at)entry=0x7fff7f2dce80) at worker.c:750
#9 0x0000559a544b5e62 in apply_dispatch (s=s(at)entry=0x7fff7f2dce80) at worker.c:968
#10 0x0000559a544b65ba in LogicalRepApplyLoop (last_received=11741462160) at worker.c:1176
#11 0x0000559a544b6d1a in ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
#12 0x0000559a5447d927 in StartBackgroundWorker () at bgworker.c:834
#13 0x0000559a5448cb6b in do_start_bgworker (rw=rw(at)entry=0x559a5558c010) at postmaster.c:5770
#14 0x0000559a5448ccf7 in maybe_start_bgworkers () at postmaster.c:5996
#15 0x0000559a5448d678 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167
#16 <signal handler called>
#17 0x00007f7a1bf4ae2a in select () from /usr/lib/libc.so.6
#18 0x0000559a5448e0f8 in ServerLoop () at postmaster.c:1668
#19 0x0000559a5448f4fc in PostmasterMain (argc=3, argv=<optimized out>) at postmaster.c:1377
#20 0x0000559a543e988a in main (argc=3, argv=0x559a55534260) at main.c:228

So it perhaps looks like some memory management issue in the subscribing
postmaster, rather than wal corruption on the publisher.

> > > Can you show us \d+ on the videos table? What's the replica identity?
> >
> > Table "public.videos"
> > Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
> > -------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
> > id | integer | | not null | nextval('videos_id_seq'::regclass) | plain | |
> > title | text | | not null | | extended | |
> > cover_image | bytea | | not null | | extended | |
> > metadata | jsonb | | not null | | extended | |
> > category | integer | | not null | | plain | |
> > published | date | | not null | | plain | |
> > added | timestamp without time zone | | not null | now() | plain | |
> > played | boolean | | not null | false | plain | |
> > Indexes:
> > "videos_pkey" PRIMARY KEY, btree (id)
> > Foreign-key constraints:
> > "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
> > Publications:
> > "pub"
> >
> > Table "public.videos"
> > Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
> > -------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
> > id | integer | | not null | nextval('videos_id_seq'::regclass) | plain | |
> > title | text | | not null | | extended | |
> > cover_image | bytea | | not null | | extended | |
> > metadata | jsonb | | not null | | extended | |
> > category | integer | | not null | | plain | |
> > published | date | | not null | | plain | |
> > added | timestamp without time zone | | not null | now() | plain | |
> > played | boolean | | not null | false | plain | |
> > Indexes:
> > "videos_pkey" PRIMARY KEY, btree (id)
> > Foreign-key constraints:
> > "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
> > Access method: heap
> >
> >
> > Googling for relica identity, it looks like it should be shown in the \d+
> > output, but it isn't, for me. Maybe because it's has default value?
> >
>
> Yes, I think it should be listed either for the index, or there should
> be an extra section with replica identity. But that depends on psql
> version. Aren't you accidentally using an older psql version? That
> should be mentioned right after connecting to the server, I think.

Maybe it has something to do with my upgrade method. I dumped/restored
the replica with pg_dumpall, and then just proceded to enable subscription
and refresh publication with (copy_data=false) for all my subscriptions.

Now, I've tried to start fresh and replicate from the primary "hometv" database
into a new l5_hometv2 on my crashing 12.1 server and it replicates fine, no
crashes. Anyway even in this fresh replicated database, there's no REPLICA
IDENTITY in the \d+ output.

I don't have multiple postgresql installations on the host. There's only
a difference between primary version 11.5 (ARMv7) and replica 12.1 (x86_64).

regards,
o.

> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2019-11-21 16:09:06 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message Tomas Vondra 2019-11-21 15:27:32 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker