Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Ondřej Jirman <megi(at)xff(dot)cz>
Cc: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 15:27:32
Message-ID: 20191121152732.hu6y4wzw7nkxpsvk@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 04:07:16PM +0100, Ondřej Jirman wrote:
>On Thu, Nov 21, 2019 at 03:35:03PM +0100, Tomas Vondra wrote:
>> On Thu, Nov 21, 2019 at 02:55:23PM +0100, Ondřej Jirman wrote:
>> > On Thu, Nov 21, 2019 at 02:32:37PM +0100, Tomas Vondra wrote:
>> > > On Thu, Nov 21, 2019 at 01:59:13PM +0100, Ondřej Jirman wrote:
>> > > > On Thu, Nov 21, 2019 at 12:53:26PM +0100, postgresql wrote:
>> > > > > Hello,
>> > > > >
>> > > > > On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote:
>> > > > > > On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote:
>> > > > > > > Replication of one of my databases (running on ARMv7 machine) started
>> > > > > > > segfaulting on the subscriber side (x86_64) like this:
>> > > > > > >
>> > > > > > > #0 0x00007fc259739917 in __memmove_sse2_unaligned_erms () from
>> > > > > > > /usr/lib/libc.so.6
>> > > > > > > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>,
>> > > > > > > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
>> > > > > > > #2 tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235
>> > > > > > > #3 0x000055d033e94d32 in ExecFetchSlotHeapTuple
>> > > > > > > (slot=slot(at)entry=0x55d0356da3b8, materialize=materialize(at)entry=true,
>> > > > > > > shouldFree=shouldFree(at)entry=0x7fff0e7cf387) at execTuples.c:1624
>> > > > >
>> > > > > I forgot to add that publisher is still PostgreSQL 11.5.
>> > > > >
>> > > >
>> > > > I can also add that I have data checksumming enabled on both ends, and
>> > > > it did not detect any corruption:
>> > > >
>> > > > # pg_verify_checksums -D /var/lib/postgres/data
>> > > > Checksum scan completed
>> > > > Data checksum version: 1
>> > > > Files scanned: 1751
>> > > > Blocks scanned: 86592
>> > > > Bad checksums: 0
>> > > >
>> > > > # pg_checksums /var/lib/postgres/data
>> > > > Checksum operation completed
>> > > > Files scanned: 22777
>> > > > Blocks scanned: 3601527
>> > > > Bad checksums: 0
>> > > > Data checksum version: 1
>> > > >
>> > > > WAL log on the publisher is also dumpable to a state hours after the issues
>> > > > started:
>> > > >
>> > > > I've put the dump here, if it's of any use: https://megous.com/dl/tmp/wal_dump.txt
>> > > >
>> > > > Dump ends with:
>> > > >
>> > > > pg_waldump: FATAL: error in WAL record at 2/BBE0E538: invalid record length at 2/BBE0E5A8: wanted 24, got 0
>> > > >
>> > > > But that seems normal. I get that error on my other database clusters, too.
>> > > >
>> > > > I managed to extract the failing logical decoding data from the publisher, if
>> > > > that helps:
>> > > >
>> > > >
>> > > > SELECT * FROM pg_logical_slot_peek_binary_changes('l5_hometv', NULL, NULL, 'proto_version', '1', 'publication_names', 'pub');
>> > > >
>> > > > 2/BBD86EA0 | 56395 | \x4200000002bbd880b800023acd790ce5510000dc4b
>> > > > 2/BBD87E90 | 56395 | \x5200004a687075626c696300766964656f73006400080169640000000017ffffffff007469746c650000000019ffffffff00636f7665725f696d6167650000000011ffffffff006d657461646174610000000edafffffff
>> > > > f0063617465676f72790000000017ffffffff007075626c6973686564000000043affffffff006164646564000000045affffffff00706c617965640000000010ffffffff
>> > > > 2/BBD87E90 | 56395 | \x5500004a684e0008740000000438333933740000005650617a6465726b613a204f206dc3a964696120736520706f7665646520626f6a2e204b64796279206ec3a17320706f6c6974696369206d696c6f76616c692c20627
>> > > > 96c6f206279206ec49b636f20c5a17061746ec49b7574000001397b226964223a20226430313064343430303965323131656162323539616331663662323230656538222c202264617465223a2022323031392d31312d3138222c20226e616d65223a20
>> > > > 2250617a6465726b613a204f206dc3a964696120736520706f7665646520626f6a2e204b64796279206ec3a17320706f6c6974696369206d696c6f76616c692c2062796c6f206279206ec49b636f20c5a17061746ec49b222c2022696d616765223a202
>> > > > 268747470733a2f2f63646e2e7873642e637a2f726573697a652f63353535656239633131353333313632386164666539396237343534353731655f657874726163743d302c302c313931392c313038305f726573697a653d3732302c3430355f2e6a70
>> > > > 673f686173683d6362316362623836336230353361613561333761346666616439303865303431227d7400000003323432740000000a323031392d31312d3138740000001a323031392d31312d31382031323a35303a30312e383136333736740000000
>> > > > 174
>> > > > 2/BBD880E8 | 56395 | \x430000000002bbd880b800000002bbd880e800023acd790ce551
>> > > >
>> > >
>> > > Can you show the whole transaction? From the WAL dump it seems it just
>> > > did a single UPDATE:
>> >
>> > I'm not sure what you mean. The above is the whole output I got from that
>> > SELECT.
>> >
>> > > rmgr: Heap len (rec/tot): 59/ 4075, tx: 56395, lsn: 2/BBD86EA0, prev 2/BBD86E68, desc: LOCK off 2: xid 56395: flags 0 LOCK_ONLY EXCL_LOCK
>> > > blkref #0: rel 1663/19030/19048 fork main blk 415 (FPW); hole: offset: 56, length: 4176
>> > > rmgr: Heap len (rec/tot): 523/ 523, tx: 56395, lsn: 2/BBD87E90, prev 2/BBD86EA0, desc: HOT_UPDATE off 2 xmax 56395 ; new off 9 xmax 0
>> > > blkref #0: rel 1663/19030/19048 fork main blk 415
>> > > rmgr: Transaction len (rec/tot): 46/ 46, tx: 56395, lsn: 2/BBD880B8, prev 2/BBD87E90, desc: COMMIT 2019-11-20 23:28:47.016273 CET
>> > >
>> > > Which fields does it update?
>> >
>> > The only update that my system does in this database is:
>> >
>> > UPDATE videos SET played = TRUE WHERE id = ?
>> >
>> > The rest is just inserts:
>> >
>> > INSERT INTO categories (name, metadata, cover_image, provider) VALUES (?, ?, ?, (SELECT id FROM providers WHERE system_name = ?))
>> > INSERT INTO videos (title, metadata, cover_image, category, published) VALUES (?, ?, ?, ?, ?)
>> >
>> > There are no other data modification statements executed.
>> >
>>
>> Well, this is definitely the updated - both judging by the WAL dump and
>> also because the backtrace contains apply_handle_update.
>
>I've also managed to get valgrind output on the segfaulting process:
>
>https://megous.com/dl/tmp/valgrind-105524-109020.log
>
>Though it probably doesn't show much new.
>

Is postgres actually compiled with valgrind support? What does pg_config
say?

If you're building the server locally (I don't think there are packages
with valgrind enabled), maybe try adding --enable-cassert option and

CFLAGS"-DRANDOMIZE_ALLOCATED_MEMORY"

which will do more extensive checks at runtime.

>> Can you show us \d+ on the videos table? What's the replica identity?
>
> Table "public.videos"
> Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
>-------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
> id | integer | | not null | nextval('videos_id_seq'::regclass) | plain | |
> title | text | | not null | | extended | |
> cover_image | bytea | | not null | | extended | |
> metadata | jsonb | | not null | | extended | |
> category | integer | | not null | | plain | |
> published | date | | not null | | plain | |
> added | timestamp without time zone | | not null | now() | plain | |
> played | boolean | | not null | false | plain | |
>Indexes:
> "videos_pkey" PRIMARY KEY, btree (id)
>Foreign-key constraints:
> "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
>Publications:
> "pub"
>
> Table "public.videos"
> Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
>-------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
> id | integer | | not null | nextval('videos_id_seq'::regclass) | plain | |
> title | text | | not null | | extended | |
> cover_image | bytea | | not null | | extended | |
> metadata | jsonb | | not null | | extended | |
> category | integer | | not null | | plain | |
> published | date | | not null | | plain | |
> added | timestamp without time zone | | not null | now() | plain | |
> played | boolean | | not null | false | plain | |
>Indexes:
> "videos_pkey" PRIMARY KEY, btree (id)
>Foreign-key constraints:
> "videos_category_fkey" FOREIGN KEY (category) REFERENCES categories(id)
>Access method: heap
>
>
>Googling for relica identity, it looks like it should be shown in the \d+
>output, but it isn't, for me. Maybe because it's has default value?
>

Yes, I think it should be listed either for the index, or there should
be an extra section with replica identity. But that depends on psql
version. Aren't you accidentally using an older psql version? That
should be mentioned right after connecting to the server, I think.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ondřej Jirman 2019-11-21 15:57:07 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message Ondřej Jirman 2019-11-21 15:07:16 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker