Re: Logical replication timeout problem

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tang, Haiying/唐 海英 <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2021-12-22 15:20:03
Message-ID: CAA5-nLDPCCRS71Y78nTd47VgAqY16WuhoPCPTHbXi_rTyzXOXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Amit,

I was able to reproduce the timeout problem in the lab.
After loading more than 20 millions of rows in a table which is not
replicated (insert command ends without error), errors related to logical
replication processes appear in the postgres log.
Approximately every 5 minutes worker process is restarted. The snap files
in the slot directory are still present. The replication system seems to be
blocked. Why these snap files are not removed. What do they contain?
I will recompile postgres with your patch to debug.

2021-12-22 14:54:21.506 CET [64939] STATEMENT: START_REPLICATION SLOT
"sub008_s000000" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s000000"')
2021-12-22 15:01:20.908 CET [64938] ERROR: terminating logical replication
worker due to timeout
2021-12-22 15:01:20.911 CET [61827] LOG: worker process: logical
replication worker for subscription 26994 (PID 64938) exited with exit code
1
2021-12-22 15:01:20.923 CET [65037] LOG: logical replication apply worker
for subscription "sub008_s000000" has started
2021-12-22 15:01:20.932 CET [65038] ERROR: replication slot
"sub008_s000000" is active for PID 64939
2021-12-22 15:01:20.932 CET [65038] STATEMENT: START_REPLICATION SLOT
"sub008_s000000" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s000000"')
2021-12-22 15:01:20.932 CET [65037] ERROR: could not start WAL streaming:
ERROR: replication slot "sub008_s000000" is active for PID 64939
2021-12-22 15:01:20.933 CET [61827] LOG: worker process: logical
replication worker for subscription 26994 (PID 65037) exited with exit code
1
2021-12-22 15:01:25.944 CET [65039] LOG: logical replication apply worker
for subscription "sub008_s000000" has started
2021-12-22 15:01:25.951 CET [65040] ERROR: replication slot
"sub008_s000000" is active for PID 64939
2021-12-22 15:01:25.951 CET [65040] STATEMENT: START_REPLICATION SLOT
"sub008_s000000" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s000000"')
2021-12-22 15:01:25.951 CET [65039] ERROR: could not start WAL streaming:
ERROR: replication slot "sub008_s000000" is active for PID 64939
2021-12-22 15:01:25.952 CET [61827] LOG: worker process: logical
replication worker for subscription 26994 (PID 65039) exited with exit code
1
2021-12-22 15:01:30.962 CET [65041] LOG: logical replication apply worker
for subscription "sub008_s000000" has started
2021-12-22 15:01:30.970 CET [65042] ERROR: replication slot
"sub008_s000000" is active for PID 64939
2021-12-22 15:01:30.970 CET [65042] STATEMENT: START_REPLICATION SLOT
"sub008_s000000" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s000000"')
2021-12-22 15:01:30.970 CET [65041] ERROR: could not start WAL streaming:
ERROR: replication slot "sub008_s000000" is active for PID 64939
2021-12-22 15:01:30.971 CET [61827] LOG: worker process: logical
replication worker for subscription 26994 (PID 65041) exited with exit code
1
2021-12-22 15:01:35.982 CET [65043] LOG: logical replication apply worker
for subscription "sub008_s000000" has started
2021-12-22 15:01:35.990 CET [65044] ERROR: replication slot
"sub008_s000000" is active for PID 64939
2021-12-22 15:01:35.990 CET [65044] STATEMENT: START_REPLICATION SLOT
"sub008_s000000" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s000000"')

-rw-------. 1 postgres postgres 16270723 Dec 22 16:02
xid-14312-lsn-23-99000000.snap
-rw-------. 1 postgres postgres 16145717 Dec 22 16:02
xid-14312-lsn-23-9A000000.snap
-rw-------. 1 postgres postgres 10889437 Dec 22 16:02
xid-14312-lsn-23-9B000000.snap
[postgres(at)s729058a debug]$ ls -ltr pg_replslot/sub008_s012a00/ | wc -l
1420

On Fri, Nov 12, 2021 at 5:57 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
wrote:

> I made a mistake in the configuration of my test script, in fact I cannot
> reproduce the problem at the moment.
> Yes, on the original environment there is physical replication, that's why
> for the lab I configured 2 nodes with physical replication.
> I'll try new tests next week
> Regards
>
> On Fri, Nov 12, 2021 at 7:23 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>> On Thu, Nov 11, 2021 at 11:15 PM Fabrice Chapuis
>> <fabrice636861(at)gmail(dot)com> wrote:
>> >
>> > Hello,
>> > Our lab is ready now. Amit, I compile Postgres 10.18 with your
>> patch.Tang, I used your script to configure logical replication between 2
>> databases and to generate 10 million entries in an unreplicated foo table.
>> On a standalone instance no error message appears in log.
>> > I activate the physical replication between 2 nodes, and I got
>> following error:
>> >
>> > 2021-11-10 10:49:12.297 CET [12126] LOG: attempt to send keep alive
>> message
>> > 2021-11-10 10:49:12.297 CET [12126] STATEMENT: START_REPLICATION
>> 0/3000000 TIMELINE 1
>> > 2021-11-10 10:49:15.127 CET [12064] FATAL: terminating logical
>> replication worker due to administrator command
>> > 2021-11-10 10:49:15.127 CET [12036] LOG: worker process: logical
>> replication worker for subscription 16413 (PID 12064) exited with exit code
>> 1
>> > 2021-11-10 10:49:15.155 CET [12126] LOG: attempt to send keep alive
>> message
>> >
>> > This message look like strange because no admin command have been
>> executed during data load.
>> > I did not find any error related to the timeout.
>> > The message coming from the modification made with the patch comes back
>> all the time: attempt to send keep alive message. But there is no "sent
>> keep alive message".
>> >
>> > Why logical replication worker exit when physical replication is
>> configured?
>> >
>>
>> I am also not sure why that happened may be due to
>> max_worker_processes reaching its limit. This can happen because it
>> seems you configured both publisher and subscriber in the same
>> cluster. Tang, did you also see the same problem?
>>
>> BTW, why are you bringing physical standby configuration into the
>> test? Does in your original setup where you observe the problem the
>> physical standbys were there?
>>
>> --
>> With Regards,
>> Amit Kapila.
>>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-12-22 15:40:17 Re: logical decoding and replication of sequences
Previous Message John Naylor 2021-12-22 15:07:49 Re: do only critical work during single-user vacuum?