Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time

From: 巨鲸 <13207961(at)qq(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time
Date: 2022-08-15 02:27:26
Message-ID: tencent_9E39E51A04CA7F2A1837F5D26734A79A2605@qq.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Masahiko,
I think maybe you should use filter-tables to filter the test table, likes:
&nbsp; &nbsp; &nbsp; &nbsp;filter-tables="public.test666"

Then it will not go to WalSndWriteData and loop a long time in ReorderBufferCommit, and the bug may be reproduced.

Regards,

--
Whale Song

------------------&nbsp;Original&nbsp;------------------
From: &nbsp;"Masahiko Sawada";<sawada(dot)mshk(at)gmail(dot)com&gt;;
Send time:&nbsp;Friday, Aug 12, 2022 11:28 AM
To:&nbsp;"巨鲸"<13207961(at)qq(dot)com&gt;; "pgsql-bugs"<pgsql-bugs(at)lists(dot)postgresql(dot)org&gt;;

Subject: &nbsp;Re: BUG #17580: use pg_terminate_backend to terminate a wal sender process may wait a long time

On Wed, Aug 10, 2022 at 5:50 PM PG Bug reporting form
<noreply(at)postgresql(dot)org&gt; wrote:
&gt;
&gt; The following bug has been logged on the website:
&gt;
&gt; Bug reference:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 17580
&gt; Logged by:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; whale song
&gt; Email address:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 13207961(at)qq(dot)com
&gt; PostgreSQL version: 10.5
&gt; Operating system:&nbsp;&nbsp; centos
&gt; Description:
&gt;
&gt; 1) Create a table for test
&gt; create table test666(id int primary key, number int);
&gt;
&gt; 2) Use pg_recvlogical to subscribe
&gt; 2.1) Create a replicate slot
&gt; pg_recvlogical -h 9.223.28.176 -p 11002 -d test -U tbase --slot
&gt; wal2json_pg_test02 --create-slot -P wal2json
&gt; 2.2) Start to subscribe, and use filter-tables to filter test table
&gt; test666
&gt; pg_recvlogical -h 9.223.28.176 -p 11002 -d test -U tbase --slot
&gt; wal2json_pg_test02 --start -o pretty-print=1 -o format-version=2 -o
&gt; filter-tables="public.test666"&nbsp; -f -
&gt;
&gt; 3) Insert a lot of data to the table in one transaction
&gt; insert into test666 values (generate_series(1000000001,2000000000),
&gt; generate_series(1000000001,2000000000));
&gt; It may cause a long time, maybe 50 minutes.
&gt;
&gt; 4)&nbsp; Use pg_replication_slots to get wal sender process pid(active_pid)
&gt; select *,pg_current_wal_lsn() from pg_replication_slots ;
&gt;
&gt; 5) When step 3 finish, use pstack to watch wal sender process stack, when it
&gt; is loop in ReorderBufferCommit, kill pg_recvlogical which start in step
&gt; 2.2
&gt;&nbsp; pstack 80743
&gt; #0 0x0000000000c400d9 in ResourceArrayAdd (resarr=0x1576970,
&gt; value=139941599240896) at resowner.c:359
&gt; #1 0x0000000000c3fef4 in ResourceArrayEnlarge (resarr=0x1576970) at
&gt; resowner.c:252
&gt; #2 0x0000000000c411f4 in ResourceOwnerEnlargeRelationRefs (owner=0x15768f0)
&gt; at resowner.c:1129
&gt; #3 0x0000000000be0899 in RelationIncrementReferenceCount
&gt; (rel=0x7f46b14fe2c0) at relcache.c:2589
&gt; #4 0x0000000000be05da in RelationIdGetRelation (relationId=1177785) at
&gt; relcache.c:2451
&gt; #5 0x00007f4e0d649b9e in pg_decode_write_tuple (ctx=0x13fc5a8,
&gt; relation=0x7f46b14fcf38, tuple=0x315b910, kind=PGOUTPUTJSON_IDENTITY) at
&gt; wal2json.c:1503
&gt; #6 0x00007f4e0d64a514 in pg_decode_write_change (ctx=0x13fc5a8,
&gt; txn=0x15957a8, relation=0x7f46b14fcf38, change=0x33bc9f8) at
&gt; wal2json.c:1747
&gt; #7 0x00007f4e0d64a9db in pg_decode_change_v2 (ctx=0x13fc5a8, txn=0x15957a8,
&gt; relation=0x7f46b14fcf38, change=0x33bc9f8) at wal2json.c:1865
&gt; #8 0x00007f4e0d648b2b in pg_decode_change (ctx=0x13fc5a8, txn=0x15957a8,
&gt; relation=0x7f46b14fcf38, change=0x33bc9f8) at wal2json.c:1098
&gt; #9 0x00000000009cf3d1 in change_cb_wrapper (cache=0x1410fb8, txn=0x15957a8,
&gt; relation=0x7f46b14fcf38, change=0x33bc9f8) at logical.c:730
&gt; #10 0x00000000009d75df in ReorderBufferCommit (rb=0x1410fb8, xid=174028801,
&gt; commit_lsn=44579135148872, end_lsn=44579135148936,
&gt; commit_time=712222669188947, origin_id=0, origin_lsn=0) at
&gt; reorderbuffer.c:1445
&gt; #11 0x00000000009cb15b in DecodeCommit (ctx=0x13fc5a8, buf=0x7fff2574d600,
&gt; parsed=0x7fff2574d530, xid=174028801) at decode.c:713
&gt; #12 0x00000000009ca76b in DecodeXactOp (ctx=0x13fc5a8, buf=0x7fff2574d600)
&gt; at decode.c:335
&gt; #13 0x00000000009ca401 in LogicalDecodingProcessRecord (ctx=0x13fc5a8,
&gt; record=0x13fc880) at decode.c:175
&gt; #14 0x00000000009ead45 in XLogSendLogical () at walsender.c:3152
&gt; #15 0x00000000009ea09f in WalSndLoop (send_data=0x9eaca9 <XLogSendLogical&gt;)
&gt; at walsender.c:2520
&gt; #16 0x00000000009e8aed in StartLogicalReplication (cmd=0x151bb48) at
&gt; walsender.c:1445
&gt; #17 0x00000000009e944e in exec_replication_command (cmd_string=0x1520408
&gt; "START_REPLICATION SLOT subs_f524p217b7_dn0001 LOGICAL 285C/F8FFEA50 (
&gt; \"format-version\" '2' )") at walsender.c:1887
&gt; #18 0x0000000000a71ed8 in PostgresMain (argc=1, argv=0x1419f58,
&gt; dbname=0x1419f28 "prd_jxcard_lingwang", username=0x1419f08 "mingchen_read")
&gt; at postgres.c:6118
&gt; #19 0x00000000009b0faa in BackendRun (port=0x14135b0) at postmaster.c:5071
&gt; #20 0x00000000009b0753 in BackendStartup (port=0x14135b0) at
&gt; postmaster.c:4743
&gt; #21 0x00000000009ac3ae in ServerLoop () at postmaster.c:1995
&gt; #22 0x00000000009aba35 in PostmasterMain (argc=5, argv=0x13dc770) at
&gt; postmaster.c:1603
&gt; #23 0x0000000000878c44 in main (argc=5, argv=0x13dc770) at main.c:233
&gt;
&gt; 6) Use pg_replication_slots to query the replication slots infomation, the
&gt; active_pid is still active. Use pg_terminate_backend to terminate
&gt; active_pid, it returns true, but the active_pid is still active.
&gt;
&gt; 7) The active_pid wal sender process exit after a long time waiting?
&gt;

Hmm, I could not reproduce this behavior in my environment. As we have
CHECK_FOR_INTERRUPTS in WalSndWriteData() and the walsender exits if
it fails to flush, I think the walsender process should exit if it
either receives SIGTERM or failed to send the data. Did you see the
log "LOG:&nbsp; could not send data to client: Broken pipe" after
terminating pg_recvlogical?

If it's reproducible in your environment, could you get the stack
trace of the walsender process that keeps running? and please try with
the latest version, 10.22.

Regards,

--
Masahiko Sawada
EDB:&nbsp; https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Smith 2022-08-15 03:46:45 Re: No-op updates with partitioning and logical replication started failing in version 13
Previous Message Tom Lane 2022-08-15 01:44:07 Re: [PATCH] Fix segfault calling PQflush on invalid connection