logical decoding bug: segfault in ReorderBufferToastReplace()

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: logical decoding bug: segfault in ReorderBufferToastReplace()
Date: 2019-12-05 01:36:16
Message-ID: b1da9abc-bad7-efd6-2f1c-503d508c3877@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-committers pgsql-hackers

On 9/8/19 14:01, Tom Lane wrote:
> Fix RelationIdGetRelation calls that weren't bothering with error checks.
>
> ...
>
> Details
> -------
> https://git.postgresql.org/pg/commitdiff/69f883fef14a3fc5849126799278abcc43f40f56

We had two different databases this week (with the same schema) both
independently hit the condition of this recent commit from Tom. It's on
11.5 so we're actually segfaulting and restarting rather than just
causing the walsender process to ERROR, but regardless there's still
some underlying bug here.

We have core files and we're still working to see if we can figure out
what's going on, but I thought I'd report now in case anyone has extra
ideas or suggestions. The segfault is on line 3034 of reorderbuffer.c.

https://github.com/postgres/postgres/blob/REL_11_5/src/backend/replication/logical/reorderbuffer.c#L3034

3033 toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
3034 toast_desc = RelationGetDescr(toast_rel);

We'll keep looking; let me know any feedback! Would love to track down
whatever bug is in the logical decoding code, if that's what it is.

==========

backtrace showing the call stack...

Core was generated by `postgres: walsender <NAME-REDACTED>
<DNS-REDACTED>(31712)'.
Program terminated with signal 11, Segmentation fault.
#0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
at reorderbuffer.c:3034
3034 reorderbuffer.c: No such file or directory.
...
(gdb) #0 ReorderBufferToastReplace (rb=0x3086af0, txn=0x3094a78,
relation=0x2b79177249c8, relation=0x2b79177249c8, change=0x30ac938)
at reorderbuffer.c:3034
#1 ReorderBufferCommit (rb=0x3086af0, xid=xid(at)entry=1358809,
commit_lsn=9430473346032, end_lsn=<optimized out>,
commit_time=commit_time(at)entry=628712466364268,
origin_id=origin_id(at)entry=0, origin_lsn=origin_lsn(at)entry=0) at
reorderbuffer.c:1584
#2 0x0000000000716248 in DecodeCommit (xid=1358809,
parsed=0x7ffc4ce123f0, buf=0x7ffc4ce125b0, ctx=0x3068f70) at decode.c:637
#3 DecodeXactOp (ctx=0x3068f70, buf=buf(at)entry=0x7ffc4ce125b0) at
decode.c:245
#4 0x000000000071655a in LogicalDecodingProcessRecord (ctx=0x3068f70,
record=0x3069208) at decode.c:117
#5 0x0000000000727150 in XLogSendLogical () at walsender.c:2886
#6 0x0000000000729192 in WalSndLoop (send_data=send_data(at)entry=0x7270f0
<XLogSendLogical>) at walsender.c:2249
#7 0x0000000000729f91 in StartLogicalReplication (cmd=0x30485a0) at
walsender.c:1111
#8 exec_replication_command (
cmd_string=cmd_string(at)entry=0x2f968b0 "START_REPLICATION SLOT
\"<NAME-REDACTED>\" LOGICAL 893/38002B98 (proto_version '1',
publication_names '\"<NAME-REDACTED>\"')") at walsender.c:1628
#9 0x000000000076e939 in PostgresMain (argc=<optimized out>,
argv=argv(at)entry=0x2fea168, dbname=0x2fea020 "<NAME-REDACTED>",
username=<optimized out>) at postgres.c:4182
#10 0x00000000004bdcb5 in BackendRun (port=0x2fdec50) at postmaster.c:4410
#11 BackendStartup (port=0x2fdec50) at postmaster.c:4082
#12 ServerLoop () at postmaster.c:1759
#13 0x00000000007062f9 in PostmasterMain (argc=argc(at)entry=7,
argv=argv(at)entry=0x2f92540) at postmaster.c:1432
#14 0x00000000004be73b in main (argc=7, argv=0x2f92540) at main.c:228

==========

Some additional context...

# select * from pg_publication_rel;
prpubid | prrelid
---------+---------
71417 | 16453
71417 | 54949
(2 rows)

(gdb) print toast_rel
$4 = (struct RelationData *) 0x0

(gdb) print *relation->rd_rel
$11 = {relname = {data = "<NAME-REDACTED>", '\000' <repeats 44 times>},
relnamespace = 16402, reltype = 16430, reloftype = 0,
relowner = 16393, relam = 0, relfilenode = 16428, reltablespace = 0,
relpages = 0, reltuples = 0, relallvisible = 0, reltoastrelid = 0,
relhasindex = true, relisshared = false, relpersistence = 112 'p',
relkind = 114 'r', relnatts = 4, relchecks = 0, relhasoids = false,
relhasrules = false, relhastriggers = false, relhassubclass = false,
relrowsecurity = false, relforcerowsecurity = false,
relispopulated = true, relreplident = 100 'd', relispartition = false,
relrewrite = 0, relfrozenxid = 1808, relminmxid = 1}

(gdb) print *relation
$12 = {rd_node = {spcNode = 1663, dbNode = 16401, relNode = 16428},
rd_smgr = 0x0, rd_refcnt = 1, rd_backend = -1, rd_islocaltemp = false,
rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000',
rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0,
rd_rel = 0x2b7917724bd8, rd_att = 0x2b7917724ce8, rd_id = 16428,
rd_lockInfo = {lockRelId = {relId = 16428, dbId = 16401}},
rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0,
rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0,
rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0,
rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0,
rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0,
rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0,
rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0,
rd_indextuple = 0x0, rd_amhandler = 0, rd_indexcxt = 0x0,
rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support =
0x0, rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0,
rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats =
0x0, rd_amcache = 0x0, rd_indcollation = 0x0,
rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x0,
rd_partcheckvalid = false, rd_partcheckcxt = 0x0}

(gdb) print *desc
$13 = {natts = 4, tdtypeid = 16430, tdtypmod = -1, tdhasoid = false,
tdrefcount = 1, constr = 0x2b7917724ef8, attrs = 0x2b7917724d08}

(gdb) print *txn
$2 = {xid = 1358809, has_catalog_changes = true, is_known_as_subxact =
false, toplevel_xid = 0, first_lsn = 9430473113640,
final_lsn = 9430473346032, end_lsn = 9430473350592,
restart_decoding_lsn = 0, origin_id = 0, origin_lsn = 0,
commit_time = 628712466364268, base_snapshot = 0x308cdc0,
base_snapshot_lsn = 9430473113776, base_snapshot_node = {prev = 0x3086b08,
next = 0x3086b08}, nentries = 357, nentries_mem = 357, serialized =
false, changes = {head = {prev = 0x30aca08, next = 0x309aac8}},
tuplecids = {head = {prev = 0x30ac878, next = 0x309ab18}}, ntuplecids
= 151, tuplecid_hash = 0x30b0bf0, toast_hash = 0x30bb460,
subtxns = {head = {prev = 0x3094b30, next = 0x3094b30}}, nsubtxns = 0,
ninvalidations = 278, invalidations = 0x30acb08, node = {
prev = 0x3086af8, next = 0x3086af8}}

(gdb) print *change
$1 = {lsn = 9430473343416, action = REORDER_BUFFER_CHANGE_INSERT,
origin_id = 0, data = {tp = {relnode = {spcNode = 1663, dbNode = 16401,
relNode = 16428}, clear_toast_afterwards = true, oldtuple = 0x0,
newtuple = 0x2b79313f9c68}, truncate = {nrelids = 70441758623359,
cascade = 44, restart_seqs = 64, relids = 0x0}, msg = {prefix =
0x40110000067f <Address 0x40110000067f out of bounds>,
message_size = 4294983724, message = 0x0}, snapshot =
0x40110000067f, command_id = 1663, tuplecid = {node = {spcNode = 1663,
dbNode = 16401, relNode = 16428}, tid = {ip_blkid = {bi_hi = 1,
bi_lo = 0}, ip_posid = 0}, cmin = 0, cmax = 826252392,
combocid = 11129}}, node = {prev = 0x30ac918, next = 0x30ac9b8}}

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Petr Fedorov 2019-12-05 07:24:16 A row-level trigger on a partitioned table is not created on a sub-partition created later
Previous Message Tom Lane 2019-12-05 01:30:08 Re: BUG #16140: View with INSERT, DO INSTEAD, and ON CONFLICT causes an error

Browse pgsql-committers by date

  From Date Subject
Next Message Tomas Vondra 2019-12-05 13:38:36 Re: logical decoding bug: segfault in ReorderBufferToastReplace()
Previous Message Michael Paquier 2019-12-05 00:41:15 Re: pgsql: Blind attempt at fixing ecpg/compatlib's build

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-12-05 01:48:23 Re: Update minimum SSL version
Previous Message Kyotaro Horiguchi 2019-12-05 01:28:03 Re: could not stat promote trigger file leads to shutdown