Re: Timeout when changes are filtered out by the core during logical replication

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, wangw(dot)fnst(at)fujitsu(dot)com
Subject: Re: Timeout when changes are filtered out by the core during logical replication
Date: 2023-01-09 14:50:47
Message-ID: CAGEoWWTY6_0PfAz=HR3KrmykgMWGFdVs8A-GfCAAEEV1XFZsbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 23, 2022 at 2:45 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> >
> > Hi All,
> > A customer ran a script dropping a few dozens of users in a transaction.
> Before dropping a user they change the ownership of the tables owned by
> that user to another user and revoking all the accesses from that user in
> the same transaction. There were a few thousand tables whose privileges and
> ownership was changed by this transaction. Since all of these changes were
> in catalog table, those changes were filtered out in
> ReorderBufferProcessTXN()
> > by the following code
> > if (!RelationIsLogicallyLogged(relation))
> > goto change_done;
> >
> > I tried to reproduce a similar situation through the attached TAP test.
> For 500 users and 1000 tables, we see that the transaction takes
> significant time but logical decoding does not take much time. So with the
> default 1 min WAL sender and receiver timeout I could not reproduce the
> timeout. Beyond that our TAp test itself times out.
> >
> > But I think there's a possibility that the logical receiver will time
> out this way when decoding a sufficiently large transaction which takes
> more than the timeout amount of time to decode. So I think we need to call
> OutputPluginUpdateProgress() after a regular interval (in terms of time or
> number of changes) to consume any feedback from the subscriber or send a
> keep-alive message.
> >
>
> I don't think it will be a good idea to directly call
> OutputPluginUpdateProgress() from reorderbuffer.c. There is already a
> patch to discuss this problem [1].
>

Yeah. I don't mean to use OutputPluginUpdateProgress() directly. The patch
just showed that it helps calling it there in some way. Thanks for pointing
the other thread. I have reviewed the patch on that thread and continue the
discussion there.

>
> > Following commit
> > ```
> > commit 87c1dd246af8ace926645900f02886905b889718
> > Author: Amit Kapila <akapila(at)postgresql(dot)org>
> > Date: Wed May 11 10:12:23 2022 +0530
> >
> > Fix the logical replication timeout during large transactions.
> >
> > ```
> > fixed a similar problem when the changes were filtered by an output
> plugin, but in this case the changes are not being handed over to the
> output plugin as well. If we fix it in the core we may not need to handle
> it in the output plugin as that commit does. The commit does not have a
> test case which I could run to reproduce the timeout.
> >
>
> It is not evident how to write a stable test for this because
> estimating how many changes are enough for the configured
> wal_receiver_timeout to
> pass on all the buildfarm machines is tricky. If you have good ideas
> then feel free to propose a test patch.
>

Will continue this on the other thread.

--
Best Wishes,
Ashutosh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Maxim Orlov 2023-01-09 15:01:37 Re: Add SHELL_EXIT_CODE to psql
Previous Message Bharath Rupireddy 2023-01-09 14:30:00 Lazy allocation of pages required for verifying FPI consistency