RE: Logical replication timeout problem

From: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
To: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: RE: Logical replication timeout problem
Date: 2022-10-19 08:15:26
Message-ID: OS3PR01MB6275478E5D29E4A563302D3D9E2B9@OS3PR01MB6275.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 18, 2022 at 22:35 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com> wrote:
> Hello Amit,
>
> In version 14.4 the timeout problem for logical replication happens again despite
> the patch provided for this issue in this version. When bulky materialized views
> are reloaded it broke logical replication. It is possible to solve this problem by
> using your new "streaming" option.
> Have you ever had this issue reported to you?
>
> Regards
>
> Fabrice
>
> 2022-10-10 17:19:02 CEST [538424]: [17-1]
> user=postgres,db=dbxxxa00,client=[local] CONTEXT: SQL statement "REFRESH
> MATERIALIZED VIEW sxxxa00.table_base"
> PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
> 2022-10-10 17:19:02 CEST [538424]: [18-1]
> user=postgres,db=dbxxxa00,client=[local] STATEMENT: select
> refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:02 CEST [538424]: [19-1]
> user=postgres,db=dbxxxa00,client=[local] LOG: duration: 264815.652
> ms statement: select refresh_materialized_view('sxxxa00.table_base');
> 2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG: automatic
> vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
> pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped frozen
> tuples: 0 removed, 48472622 remain, 0 are dead but not yet removable,
> oldest xmin: 1501528
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item
> identifiers removed
> I/O timings: read: 1.494 ms, write: 0.000 ms
> avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
> buffer usage: 593301 hits, 77 misses, 294605 dirtied
> WAL usage: 296644 records, 46119 full page images, 173652718 bytes
> system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
> 2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG: automatic
> analyze of table "dbxxxa00.sxxxa00.table_base"
> I/O timings: read: 0.043 ms, write: 0.000 ms
> avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
> buffer usage: 30308 hits, 2 misses, 2 dirtied
> system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
> 2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG: checkpoint
> complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=269.551 s, sync=0.002 s, total=269.560 s; sync files=251, longest=0.00
> 1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
> 2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR: terminating
> logical replication worker due to timeout
> 2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG: background
> worker "logical replication worker" (PID 716163) exited with exit code 1
> 2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG: logical
> replication apply worker for subscription "subxxx_sxxxa00" has started

Thanks for reporting!

There is one thing I want to confirm:
Is the statement `select refresh_materialized_view('sxxxa00.table_base');`
executed on the publisher-side?

If so, I think the reason for this timeout problem could be that during DDL
(`REFRESH MATERIALIZED VIEW`), lots of temporary data is generated due to
rewrite. Since these temporary data will not be processed by the pgoutput
plugin, our previous fix for DML had no impact on this case.
I think setting "streaming" option to "on" could work around this problem.

I tried to write a draft patch (see attachment) on REL_14_4 to fix this.
I tried it locally and it seems to work.
Could you please confirm whether this problem is fixed after applying this
draft patch?

If this draft patch works, I will improve it and try to fix this problem.

Regards,
Wang wei

Attachment Content-Type Size
0001-draft-for-REL_14_4.patch application/octet-stream 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2022-10-19 08:17:17 Re: make_ctags: use -I option to ignore pg_node_attr macro
Previous Message Richard Guo 2022-10-19 08:02:34 Some comments that should've covered MERGE