Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker

From: Ondřej Jirman <ienieghapheoghaiwida(at)xff(dot)cz>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Date: 2019-11-21 13:55:23
Message-ID: 20191121135523.aybyo7o6yzxxungd@core.my.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 21, 2019 at 02:32:37PM +0100, Tomas Vondra wrote:
> On Thu, Nov 21, 2019 at 01:59:13PM +0100, Ondřej Jirman wrote:
> > On Thu, Nov 21, 2019 at 12:53:26PM +0100, postgresql wrote:
> > > Hello,
> > >
> > > On Thu, Nov 21, 2019 at 11:39:40AM +0100, Tomas Vondra wrote:
> > > > On Thu, Nov 21, 2019 at 01:14:18AM +0000, PG Bug reporting form wrote:
> > > > > Replication of one of my databases (running on ARMv7 machine) started
> > > > > segfaulting on the subscriber side (x86_64) like this:
> > > > >
> > > > > #0 0x00007fc259739917 in __memmove_sse2_unaligned_erms () from
> > > > > /usr/lib/libc.so.6
> > > > > #1 0x000055d033e93d44 in memcpy (__len=620701425, __src=<optimized out>,
> > > > > __dest=0x55d0356da804) at /usr/include/bits/string_fortified.h:34
> > > > > #2 tts_virtual_materialize (slot=0x55d0356da3b8) at execTuples.c:235
> > > > > #3 0x000055d033e94d32 in ExecFetchSlotHeapTuple
> > > > > (slot=slot(at)entry=0x55d0356da3b8, materialize=materialize(at)entry=true,
> > > > > shouldFree=shouldFree(at)entry=0x7fff0e7cf387) at execTuples.c:1624
> > >
> > > I forgot to add that publisher is still PostgreSQL 11.5.
> > >
> >
> > I can also add that I have data checksumming enabled on both ends, and
> > it did not detect any corruption:
> >
> > # pg_verify_checksums -D /var/lib/postgres/data
> > Checksum scan completed
> > Data checksum version: 1
> > Files scanned: 1751
> > Blocks scanned: 86592
> > Bad checksums: 0
> >
> > # pg_checksums /var/lib/postgres/data
> > Checksum operation completed
> > Files scanned: 22777
> > Blocks scanned: 3601527
> > Bad checksums: 0
> > Data checksum version: 1
> >
> > WAL log on the publisher is also dumpable to a state hours after the issues
> > started:
> >
> > I've put the dump here, if it's of any use: https://megous.com/dl/tmp/wal_dump.txt
> >
> > Dump ends with:
> >
> > pg_waldump: FATAL: error in WAL record at 2/BBE0E538: invalid record length at 2/BBE0E5A8: wanted 24, got 0
> >
> > But that seems normal. I get that error on my other database clusters, too.
> >
> > I managed to extract the failing logical decoding data from the publisher, if
> > that helps:
> >
> >
> > SELECT * FROM pg_logical_slot_peek_binary_changes('l5_hometv', NULL, NULL, 'proto_version', '1', 'publication_names', 'pub');
> >
> > 2/BBD86EA0 | 56395 | \x4200000002bbd880b800023acd790ce5510000dc4b
> > 2/BBD87E90 | 56395 | \x5200004a687075626c696300766964656f73006400080169640000000017ffffffff007469746c650000000019ffffffff00636f7665725f696d6167650000000011ffffffff006d657461646174610000000edafffffff
> > f0063617465676f72790000000017ffffffff007075626c6973686564000000043affffffff006164646564000000045affffffff00706c617965640000000010ffffffff
> > 2/BBD87E90 | 56395 | \x5500004a684e0008740000000438333933740000005650617a6465726b613a204f206dc3a964696120736520706f7665646520626f6a2e204b64796279206ec3a17320706f6c6974696369206d696c6f76616c692c20627
> > 96c6f206279206ec49b636f20c5a17061746ec49b7574000001397b226964223a20226430313064343430303965323131656162323539616331663662323230656538222c202264617465223a2022323031392d31312d3138222c20226e616d65223a20
> > 2250617a6465726b613a204f206dc3a964696120736520706f7665646520626f6a2e204b64796279206ec3a17320706f6c6974696369206d696c6f76616c692c2062796c6f206279206ec49b636f20c5a17061746ec49b222c2022696d616765223a202
> > 268747470733a2f2f63646e2e7873642e637a2f726573697a652f63353535656239633131353333313632386164666539396237343534353731655f657874726163743d302c302c313931392c313038305f726573697a653d3732302c3430355f2e6a70
> > 673f686173683d6362316362623836336230353361613561333761346666616439303865303431227d7400000003323432740000000a323031392d31312d3138740000001a323031392d31312d31382031323a35303a30312e383136333736740000000
> > 174
> > 2/BBD880E8 | 56395 | \x430000000002bbd880b800000002bbd880e800023acd790ce551
> >
>
> Can you show the whole transaction? From the WAL dump it seems it just
> did a single UPDATE:

I'm not sure what you mean. The above is the whole output I got from that
SELECT.

> rmgr: Heap len (rec/tot): 59/ 4075, tx: 56395, lsn: 2/BBD86EA0, prev 2/BBD86E68, desc: LOCK off 2: xid 56395: flags 0 LOCK_ONLY EXCL_LOCK
> blkref #0: rel 1663/19030/19048 fork main blk 415 (FPW); hole: offset: 56, length: 4176
> rmgr: Heap len (rec/tot): 523/ 523, tx: 56395, lsn: 2/BBD87E90, prev 2/BBD86EA0, desc: HOT_UPDATE off 2 xmax 56395 ; new off 9 xmax 0
> blkref #0: rel 1663/19030/19048 fork main blk 415
> rmgr: Transaction len (rec/tot): 46/ 46, tx: 56395, lsn: 2/BBD880B8, prev 2/BBD87E90, desc: COMMIT 2019-11-20 23:28:47.016273 CET
>
> Which fields does it update?

The only update that my system does in this database is:

UPDATE videos SET played = TRUE WHERE id = ?

The rest is just inserts:

INSERT INTO categories (name, metadata, cover_image, provider) VALUES (?, ?, ?, (SELECT id FROM providers WHERE system_name = ?))
INSERT INTO videos (title, metadata, cover_image, category, published) VALUES (?, ?, ?, ?, ?)

There are no other data modification statements executed.

There may be updates in other databases in the same cluster, but replication
worker is segfaulting only on the "hometv" database.

regards,
o.

> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2019-11-21 14:31:59 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker
Previous Message Ondřej Jirman 2019-11-21 13:46:45 Re: BUG #16129: Segfault in tts_virtual_materialize in logical replication worker