RE: Logical replication timeout problem

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Fabrice Chapuis <fabrice636861(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: RE: Logical replication timeout problem
Date: 2023-01-23 10:03:30
Message-ID: OS0PR01MB5716D781C642CCF759C5B50894C89@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, January 23, 2023 8:51 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Here are my review comments for patch v4-0001
> ======
> Commit message
>
> 2.
>
> The problem is when there is a DDL in a transaction that generates lots of
> temporary data due to rewrite rules, these temporary data will not be processed
> by the pgoutput plugin. The previous commit (f95d53e) only fixed timeouts
> caused by filtering out changes in pgoutput. Therefore, the previous fix for DML
> had no impact on this case.
>
> ~
>
> IMO this still some rewording to say up-front what the the actual problem -- i.e.
> an avoidable timeout occuring.
>
> SUGGESTION (or something like this...)
>
> When there is a DDL in a transaction that generates lots of temporary data due
> to rewrite rules, this temporary data will not be processed by the pgoutput
> plugin. This means it is possible for a timeout to occur if a sufficiently long time
> elapses since the last pgoutput message. A previous commit (f95d53e) fixed a
> similar scenario in this area, but that only fixed timeouts for DML going through
> pgoutput, so it did not address this DDL timeout case.

Thanks, I changed the commit message as suggested.

> ======
> src/backend/replication/logical/logical.c
>
> 3. update_progress_txn_cb_wrapper
>
> +/*
> + * Update progress callback while processing a transaction.
> + *
> + * Try to update progress and send a keepalive message during sending
> +data of a
> + * transaction (and its subtransactions) to the output plugin.
> + *
> + * For a large transaction, if we don't send any change to the
> +downstream for a
> + * long time (exceeds the wal_receiver_timeout of standby) then it can timeout.
> + * This can happen when all or most of the changes are either not
> +published or
> + * got filtered out.
> + */
> +static void
> +update_progress_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN
> *txn,
> + ReorderBufferChange *change)
>
> Simplify the "Try to..." paragraph. And other part should also mention about DDL.
>
> SUGGESTION
>
> Try send a keepalive message during transaction processing.
>
> This is done because if we don't send any change to the downstream for a long
> time (exceeds the wal_receiver_timeout of standby), then it can timeout. This can
> happen for large DDL, or for large transactions when all or most of the changes
> are either not published or got filtered out.

Changed.

> ======
> .../replication/logical/reorderbuffer.c
>
> 4. ReorderBufferProcessTXN
>
> @@ -2105,6 +2105,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>
> PG_TRY();
> {
> + /*
> + * Static variable used to accumulate the number of changes while
> + * processing txn.
> + */
> + static int changes_count = 0;
> +
> + /*
> + * Sending keepalive messages after every change has some overhead, but
> + * testing showed there is no noticeable overhead if keepalive is only
> + * sent after every ~100 changes.
> + */
> +#define CHANGES_THRESHOLD 100
> +
>
> IMO these can be relocated to be declared/defined inside the "while"
> loop -- i.e. closer to where they are being used.

Moved into the while loop.

Attach the new version patch which addressed above comments.
Also attach a simple script which use "refresh matview" to reproduce
this timeout problem just in case some one want to try to reproduce this.

Best regards,
Hou zj

Attachment Content-Type Size
test.sh application/octet-stream 2.0 KB
v5-0001-Fix-the-logical-replication-timeout-during-proces.patch application/octet-stream 11.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema 2023-01-23 10:44:11 Re: run pgindent on a regular basis / scripted manner
Previous Message Heikki Linnakangas 2023-01-23 10:02:21 Re: Polyphase merge is obsolete