Quick Links

Re: Logical replication timeout problem

From:	Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Euler Taveira <euler(at)eulerto(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject:	Re: Logical replication timeout problem
Date:	2022-10-18 14:35:02
Message-ID:	CAA5-nLC9HdJPgCSOcVCZ7YMmhv3nLgGbReMB5jyEGbXA6VVsyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello Amit,

In version 14.4 the timeout problem for logical replication happens again
despite the patch provided for this issue in this version. When bulky
materialized views are reloaded it broke logical replication. It is
possible to solve this problem by using your new "streaming" option.
Have you ever had this issue reported to you?

Regards

Fabrice

2022-10-10 17:19:02 CEST [538424]: [17-1]
user=postgres,db=dbxxxa00,client=[local] CONTEXT: SQL statement "REFRESH
MATERIALIZED VIEW sxxxa00.table_base"
PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
2022-10-10 17:19:02 CEST [538424]: [18-1]
user=postgres,db=dbxxxa00,client=[local] STATEMENT: select
refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:02 CEST [538424]: [19-1]
user=postgres,db=dbxxxa00,client=[local] LOG: duration: 264815.652 ms
statement: select refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG: automatic
vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped
frozen
tuples: 0 removed, 48472622 remain, 0 are dead but not yet
removable, oldest xmin: 1501528
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
I/O timings: read: 1.494 ms, write: 0.000 ms
avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
buffer usage: 593301 hits, 77 misses, 294605 dirtied
WAL usage: 296644 records, 46119 full page images, 173652718 bytes
system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG: automatic
analyze of table "dbxxxa00.sxxxa00.table_base"
I/O timings: read: 0.043 ms, write: 0.000 ms
avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
buffer usage: 30308 hits, 2 misses, 2 dirtied
system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:
checkpoint complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0
removed, 0 recycled; write=269.551 s, sync=0.002 s, total=269.560 s; sync
files=251, longest=0.00
1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:
background worker "logical replication worker" (PID 716163) exited with
exit code 1
2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG: logical
replication apply worker for subscription "subxxx_sxxxa00" has started

On Fri, Apr 1, 2022 at 6:09 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira <euler(at)eulerto(dot)com> wrote:
> >
> > On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
> >
> > This is exactly our initial analysis and we have tried a patch on
> > these lines and it has a noticeable overhead. See [1]. Calling this
> > for each change or each skipped change can bring noticeable overhead
> > that is why we decided to call it after a certain threshold (100) of
> > skipped changes. Now, surely as mentioned in my previous reply we can
> > make it generic such that instead of calling this (update_progress
> > function as in the patch) for skipped cases, we call it always. Will
> > that make it better?
> >
> > That's what I have in mind but using a different approach.
> >
> > > The functions CreateInitDecodingContext and CreateDecodingContext
> receives the
> > > update_progress function as a parameter. These functions are called in
> 2
> > > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT)
> and (b)
> > > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > > WalSndUpdateProgress as a progress function. Case (b) does not have
> one because
> > > it is not required -- local decoding/communication. There is no custom
> update
> > > progress routine for each output plugin which leads me to the question:
> > > couldn't we encapsulate the update progress call into the callback
> functions?
> > >
> >
> > Sorry, I don't get your point. What exactly do you mean by this?
> > AFAIS, currently we call this output plugin API in pgoutput functions
> > only, do you intend to get it invoked from a different place?
> >
> > It seems I didn't make myself clear. The callbacks I'm referring to the
> > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> > *_cb_wrapper() function, we have something like:
> >
> > if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> > NewUpdateProgress(ctx, false);
> >
> > The NewUpdateProgress function would contain a logic similar to the
> > update_progress() from the proposed patch. (A different function name
> here just
> > to avoid confusion.)
> >
> > The output plugin is responsible to set ctx->progress with the callback
> > variables (for example, PGOUTPUT_PROGRESS_CHANGE for change_cb()) that
> we would
> > like to run NewUpdateProgress.
> >
>
> This sounds like a conflicting approach to what we currently do.
> Currently, OutputPluginUpdateProgress() is called from the xact
> related pgoutput functions like pgoutput_commit_txn(),
> pgoutput_prepare_txn(), pgoutput_commit_prepared_txn(), etc. So, if we
> follow what you are saying then for some of the APIs like
> pgoutput_change/_message/_truncate, we need to set the parameter to
> invoke NewUpdateProgress() which will internally call
> OutputPluginUpdateProgress(), and for the remaining APIs, we will call
> in the corresponding pgoutput_* function. I feel if we want to make it
> more generic than the current patch, it is better to directly call
> what you are referring to here as NewUpdateProgress() in all remaining
> APIs like pgoutput_change/_truncate, etc.
>
> --
> With Regards,
> Amit Kapila.
>

In response to

Re: Logical replication timeout problem at 2022-04-01 04:08:52 from Amit Kapila

Responses

RE: Logical replication timeout problem at 2022-10-19 08:15:26 from wangw.fnst@fujitsu.com

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2022-10-18 14:46:46	Re: havingQual vs hasHavingQual buglets
Previous Message	Bharath Rupireddy	2022-10-18 13:24:22	Re: thinko in basic_archive.c