Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Bowen Shi <zxwsbg12138(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: BUG #17716: walsender process hang while decoding 'DROP PUBLICATION' XLOG
Date: 2022-12-19 11:27:29
Message-ID: CAJpy0uAnQHf05tGJ0jBtFz+L3R9vXmynT5kMSp+qYiOo=J=14g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello,
The problem is not reproducing on HEAD but reproducing on 13.9

*Problem Description*:
On drop of large publication with 10,000 tables added to it, walsender
hangs in
ReorderBufferCommit
-->ReorderBufferExecuteInvalidations->....hash_seq_search()

*Root Cause:*
Drop publication has 10000 tuples (table entries) to be deleted from
catalogs. So it creates invalidation msgs to invalidate various cache
entries. Each table-entry deletion results in 4 invalidation msgs. So for
complete drop publication operation, it adds to 40,000 invalidation msgs
aprox (+few more for publication drop itself). These invalidation msgs are
added to top-txn.

Later when Walsender tries to read Xlog of 'drop publication' operation, it
tries to replay these invalidation msgs for each command-id increment.
Command-id is incremented from 1 to 10,000. So for each command-id, it
replays all 40,000 invalidation msgs. There is no way to identify that only
4 invalidation msgs belong to a particular command-id and thus all
accumulative msgs are played on each command-id increment. This results in
a hang like situation in the above stack. And thus insert executed
(belonging to different publication) immediately after drop publication is
not received on subscriber.

*What fixes it on HEAD:*
On later versions, there is a new mechanism introduced where-in after each
command id increment, we log invalidations to be done for that command-id.
Record of type XLOG_XACT_INVALIDATIONS is inserted in XLOG. For the case in
hand, each such invalidation record has 4 invalidation msgs for respective
command-id. So all 40,000 invalidation msgs are inserted to Xlog but in
pieces i.e. 10,000 records for invalidations, each record representing
invalidations for 1 particular command id and each having 4 invalidation
msgs.
Later when walsender tries to replay, it replays only that particualr's
command-id xlog-record after each command-id increment (in the order these
were inserted). Thus we perform invalidation 10,000 times but for 4 msgs at
once, instead of 40,000 msgs every time in the previous version.

These changes are committed through below commit Ids:

*Full change:*
commit c55040ccd017962b7b8d3fbcdc184aa90c722a21
Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Date: Thu Jul 23 08:19:07 2020 +0530

*Trivial Fix:*
commit 36e545cd051ffe58064c1f38faeb64f4feef4aec
Author: Amit Kapila
Date: Thu Nov 10 16:56:49 2022 +0530

I have not tried it yet in version 14, hopefully we will see it working
there. I will try and update it soon.

thanks
Shveta

On Tue, Dec 13, 2022 at 5:43 PM Bowen Shi <zxwsbg12138(at)gmail(dot)com> wrote:

> The following step can help reproducing the problem:
> 1. Run "initinstance.sh" and get two instance, one publisher and one
> subscriber, this script will create a subscription as well;
> 2. Run “createschema.sh” and create 1000 schemas, each schema has 100
> tables;
> 3. Run “createpublication.sh” and create a huge publication, which
> publishes 10000 tables;
> 4. Run “insert.sh” and insert exactly one tuple into each table
> created in Step 2;
> 5. Run “insert into t values(1);” in publisher, and then we can found
> it in subscriber;
> 6. Run “drop publication pub_t_large;” in publisher;
> 7. Run “insert into t values(1);” in publisher, and then we can not
> find it in subscriber, cause the walsender has hanged in decoding Step
> 6’s XLOG;
>
>
>
> PG Bug reporting form <noreply(at)postgresql(dot)org> 于2022年12月13日周二 15:51写道:
> >
> > The following bug has been logged on the website:
> >
> > Bug reference: 17716
> > Logged by: bowenshi
> > Email address: zxwsbg12138(at)gmail(dot)com
> > PostgreSQL version: 13.9
> > Operating system: centos
> > Description:
> >
> > Dears, I have found a problem while using logical decoding. I have two
> > publications, and the walsender process hang while decoding DROP
> PUBLICATION
> > XLOG of another publication. The pstack looks like following:
> >
> > #0 0x00000000008b9c24 in hash_seq_search ()
> > #1 0x00007fd14340590c in rel_sync_cache_publication_cb () from
> > pgoutput.so
> > #2 0x0000000000897a1a in CallSyscacheCallbacks ()
> > #3 0x000000000073682e in ReorderBufferCommit ()
> > #4 0x000000000072b988 in DecodeXactOp ()
> > #5 0x000000000072bd52 in LogicalDecodingProcessRecord ()
> > #6 0x000000000074d14f in XLogSendLogical ()
> > #7 0x000000000074f562 in WalSndLoop ()
> > #8 0x0000000000750322 in exec_replication_command ()
> > #9 0x0000000000792301 in PostgresMain ()
> > #10 0x000000000048854f in ServerLoop ()
> > #11 0x000000000071b579 in PostmasterMain ()
> > #12 0x000000000048948d in main ()
> >
> > When I user perl to analyze, it spend almost 99% time in
> `hash_seq_search`
> > function.
> >
> > This problem can be reproduced , and I will send the shell script later.
> >
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Robins Tharakan 2022-12-19 12:03:44 Re: BUG #17713: Assert with postgres_fdw in v12
Previous Message Michael Paquier 2022-12-19 03:21:26 Re: BUG #17717: Regression in vacuumdb (15 is slower than 10/11 and possible memory issue)