RE: Forget close an open relation in ReorderBufferProcessTXN()

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Amit Langote' <amitlangote09(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Japin Li <japinli(at)hotmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Forget close an open relation in ReorderBufferProcessTXN()
Date: 2021-05-21 06:55:10
Message-ID: OSBPR01MB4888DCAC385E707D74B2957EED299@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, May 20, 2021 9:59 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> Here are updated/divided patches.
Thanks for your updates.

But, I've detected segmentation faults caused by the patch,
which can happen during 100_bugs.pl in src/test/subscription.
This happens more than one in ten times.

This problem would be a timing issue and has been introduced by v3 already.
I used v5 for HEAD also and reproduced this failure, while
OSS HEAD doesn't reproduce this, even when I executed 100_bugs.pl 200 times in a tight loop.
I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from v3.

* The message of the failure during TAP test.

# Postmaster PID for node "twoways" is 5015
Waiting for replication conn testsub's replay_lsn to pass pg_current_wal_lsn() on twoways
# poll_query_until timed out executing this query:
# SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'testsub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# psql: error: connection to server on socket "/tmp/cs8dhFOtZZ/.s.PGSQL.59345" failed: No such file or directory
# Is the server running locally and accepting connections on that socket?
timed out waiting for catchup at t/100_bugs.pl line 148.

The failure produces core file and its back trace is below.
My first guess of the cause is that between the timing to get an entry from hash_search() in get_rel_sync_entry()
and to set the map by convert_tuples_by_name() in maybe_send_schema(), we had invalidation message,
which tries to free unset descs in the entry ?

* core file backtrace

Core was generated by `postgres: twoways: walsender k5user [local] START_REPLICATION '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225
1225 FreeTupleDesc(entry->map->indesc);
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libicu-50.2-4.el7_7.x86_64 libstdc++-4.8.5-44.el7.x86_64
(gdb) bt
#0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225
#1 0x0000000000ae21f0 in LocalExecuteInvalidationMessage (msg=0x21d1de8) at inval.c:601
#2 0x00000000008dbd6e in ReorderBufferExecuteInvalidations (nmsgs=4, msgs=0x21d1db8) at reorderbuffer.c:3232
#3 0x00000000008da70a in ReorderBufferProcessTXN (rb=0x21d1a40, txn=0x2210b58, commit_lsn=25569096, snapshot_now=0x21d1e10, command_id=1, streaming=false)
at reorderbuffer.c:2294
#4 0x00000000008dae56 in ReorderBufferReplay (txn=0x2210b58, rb=0x21d1a40, xid=748, commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619,
origin_id=0, origin_lsn=0) at reorderbuffer.c:2591
#5 0x00000000008daede in ReorderBufferCommit (rb=0x21d1a40, xid=748, commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619, origin_id=0,
origin_lsn=0) at reorderbuffer.c:2615
#6 0x00000000008cae06 in DecodeCommit (ctx=0x21e6880, buf=0x7fffb9383db0, parsed=0x7fffb9383c10, xid=748, two_phase=false) at decode.c:744
#7 0x00000000008ca1ed in DecodeXactOp (ctx=0x21e6880, buf=0x7fffb9383db0) at decode.c:278
#8 0x00000000008c9e76 in LogicalDecodingProcessRecord (ctx=0x21e6880, record=0x21e6c80) at decode.c:142
#9 0x0000000000901fcc in XLogSendLogical () at walsender.c:2876
#10 0x0000000000901327 in WalSndLoop (send_data=0x901f30 <XLogSendLogical>) at walsender.c:2306
#11 0x00000000008ffd5f in StartLogicalReplication (cmd=0x219aff8) at walsender.c:1206
#12 0x00000000009006ae in exec_replication_command (
cmd_string=0x2123c20 "START_REPLICATION SLOT \"pg_16400_sync_16392_6964617299612181363\" LOGICAL 0/182D058 (proto_version '2', publication_names '\"testpub\"')") at walsender.c:1646
#13 0x000000000096ffae in PostgresMain (argc=1, argv=0x7fffb93840d0, dbname=0x214ef18 "d1", username=0x214eef8 "k5user") at postgres.c:4482

I'll update when I get more information.

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-05-21 06:55:11 Re: Move pg_attribute.attcompression to earlier in struct for reduced size?
Previous Message Jürgen Purtz 2021-05-21 06:47:15 Re: Additional Chapter for Tutorial