Re: Logical replication timeout problem

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tang, Haiying/唐 海英 <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2022-01-11 14:43:29
Message-ID: CAA5-nLCUfqGmPCjK_NnKuJuud98rB3YNQYefnwfvq+JkxYY4Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Can you explain why you think this will help in solving your current
problem?

Indeed your are right this function won't help, we have to look elsewhere.

It is still not clear to me why the problem happened? IIUC, after restoring
4096 changes from snap files, we send them to the subscriber, and then
apply worker should apply those one by one. Now, is it taking one minute to
restore 4096 changes due to which apply worker is timed out?

Now I can easily reproduce the problem.
In a first phase, snap files are generated and stored in pg_replslot. This
process end when1420 files are present in pg_replslots (this is in relation
with statements that must be replayed from WAL). In the pg_stat_replication
view, the state field is set to *catchup*.
In a 2nd phase, the snap files must be decoded. However after one minute
(wal_receiver_timeout parameter set to 1 minute) the worker process stop
with a timeout.

I can put a debug point to check if a timeout is sent to the worker
process. Do you have any other clue?

Thank you for your help

Fabrice

On Fri, Jan 7, 2022 at 11:26 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Wed, Dec 29, 2021 at 5:02 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
> wrote:
>
>> I put the instance with high level debug mode.
>> I try to do some log interpretation: After having finished writing the
>> modifications generated by the insert in the snap files,
>> then these files are read (restored). One minute after this work starts,
>> the worker process exit with an error code = 1.
>> I see that keepalive messages were sent before the work process work
>> leave.
>>
>> 2021-12-28 10:50:01.894 CET [55792] LOCATION: WalSndKeepalive,
>> walsender.c:3365
>> ...
>> 2021-12-28 10:50:31.854 CET [55792] STATEMENT: START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: StartTransaction(1)
>> name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
>> 2021-12-28 10:50:31.907 CET [55792] LOCATION: ShowTransactionStateRec,
>> xact.c:5075
>> 2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: spill 2271 changes in
>> XID 14312 to disk
>> 2021-12-28 10:50:31.907 CET [55792] LOCATION: ReorderBufferSerializeTXN,
>> reorderbuffer.c:2245
>> 2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> *2021-12-28 10:50:32.110 CET [55792] DEBUG: 00000: restored
>> 4096/22603999 changes from disk*
>> 2021-12-28 10:50:32.110 CET [55792] LOCATION: ReorderBufferIterTXNNext,
>> reorderbuffer.c:1156
>> 2021-12-28 10:50:32.110 CET [55792] STATEMENT: START_REPLICATION SLOT
>> "sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
>> '"pub008_s012a00"')
>> 2021-12-28 10:50:32.138 CET [55792] DEBUG: 00000: restored 4096/22603999
>> changes from disk
>> ...
>>
>> *2021-12-28 10:50:35.341 CET [55794] DEBUG: 00000: sending replication
>> keepalive2021-12-28 10:50:35.341 CET [55794] LOCATION: WalSndKeepalive,
>> walsender.c:3365*
>> ...
>> *2021-12-28 10:51:31.995 CET [55791] ERROR: XX000: terminating logical
>> replication worker due to timeout*
>>
>> *2021-12-28 10:51:31.995 CET [55791] LOCATION: LogicalRepApplyLoop,
>> worker.c:1267*
>>
>>
> It is still not clear to me why the problem happened? IIUC, after
> restoring 4096 changes from snap files, we send them to the subscriber, and
> then apply worker should apply those one by one. Now, is it taking one
> minute to restore 4096 changes due to which apply worker is timed out?
>
> Could this function in* Apply main loop* in worker.c help to find a
>> solution?
>>
>> rc = WaitLatchOrSocket(MyLatch,
>> WL_SOCKET_READABLE | WL_LATCH_SET |
>> WL_TIMEOUT | WL_POSTMASTER_DEATH,
>> fd, wait_time,
>> WAIT_EVENT_LOGICAL_APPLY_MAIN);
>>
>
> Can you explain why you think this will help in solving your current
> problem?
>
> --
> With Regards,
> Amit Kapila.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Allie Crawford 2022-01-11 14:47:20 Re: [Ext:] Re: Stream Replication not working
Previous Message Fujii Masao 2022-01-11 14:28:43 Re: enhance pg_log_backend_memory_contexts() to log memory contexts of auxiliary processes