Re: Logical replication failed with SSL SYSCALL error

From: shaurya jain <12345shaurya(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication failed with SSL SYSCALL error
Date: 2023-04-24 02:59:44
Message-ID: CAHHJ3NTgRi70cwAWFULzAc+fsPeBf_=O_VAMO8tH5FbLXFFjag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hi Vignesh,

That's really prompt and solves our problem. Thank you buddy.

Please go through my inline comments:-

On Thu, Apr 20, 2023 at 11:49 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:

> On Wed, 19 Apr 2023 at 17:26, shaurya jain <12345shaurya(at)gmail(dot)com> wrote:
> >
> > Hi Team,
> >
> > Could you please help me with this, It's urgent for the production
> environment.
> >
> > On Wed, Apr 19, 2023 at 3:44 PM shaurya jain <12345shaurya(at)gmail(dot)com>
> wrote:
> >>
> >> Hi Team,
> >>
> >> Could you please help, It's urgent for the production env?
> >>
> >> On Sun, Apr 16, 2023 at 2:40 AM shaurya jain <12345shaurya(at)gmail(dot)com>
> wrote:
> >>>
> >>> Hi Team,
> >>>
> >>> Postgres Version:- 13.8
> >>> Issue:- Logical replication failing with SSL SYSCALL error
> >>> Priority:-High
> >>>
> >>> We are migrating our database through logical replications, and all of
> sudden below error pops up in the source and target logs which leads us to
> nowhere.
> >>>
> >>> Logs from Source:-
> >>> LOG: could not send data to client: Connection reset by peer
> >>> STATEMENT: COPY public.test TO STDOUT
> >>> FATAL: connection to client lost
> >>> STATEMENT: COPY public.test TO STDOUT
> >>>
> >>> Logs from Target:-
> >>> 2023-04-15 19:07:02 UTC::@:[1250]:ERROR: could not receive data from
> WAL stream: SSL SYSCALL error: Connection timed out
> >>> 2023-04-15 19:07:02 UTC::@:[1250]:CONTEXT: COPY test, line 365326932
> >>> 2023-04-15 19:07:03 UTC::@:[505]:LOG: background worker "logical
> replication worker" (PID 1250) exited with exit code 1
> >>> 2023-04-15 19:07:03 UTC::@:[7155]:LOG: logical replication table
> synchronization worker for subscription " sub_tables_2_180", table "test"
> has started
> >>> 2023-04-15 19:12:05 UTC:10.144.19.34(33276):postgres(at)webadmit_staging:[7112]:WARNING:
> there is no transaction in progress
> >>> 2023-04-15 19:14:08 UTC:10.144.19.34(33324):postgres(at)webadmit_staging:[6052]:LOG:
> could not receive data from client: Connection reset by peer
> >>> 2023-04-15 19:17:23 UTC::@:[2112]:ERROR: could not receive data from
> WAL stream: SSL SYSCALL error: Connection timed out
> >>> 2023-04-15 19:17:23 UTC::@:[1089]:ERROR: could not receive data from
> WAL stream: SSL SYSCALL error: Connection timed out
> >>> 2023-04-15 19:17:23 UTC::@:[2556]:ERROR: could not receive data from
> WAL stream: SSL SYSCALL error: Connection timed out
> >>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
> replication worker" (PID 2556) exited with exit code 1
> >>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
> replication worker" (PID 2112) exited with exit code 1
> >>> 2023-04-15 19:17:23 UTC::@:[505]:LOG: background worker "logical
> replication worker" (PID 1089) exited with exit code 1
> >>> 2023-04-15 19:17:23 UTC::@:[7287]:LOG: logical replication apply
> worker for subscription "sub_tables_2_180" has started
> >>> 2023-04-15 19:17:23 UTC::@:[7288]:LOG: logical replication apply
> worker for subscription "sub_tables_3_192" has started
> >>> 2023-04-15 19:17:23 UTC::@:[7289]:LOG: logical replication apply
> worker for subscription "sub_tables_1_180" has started
> >>>
> >>> Just after this error, all other replication slots get disabled for
> some time and come back online along with COPY command with the new PID in
> pg_stat_activity.
> >>>
> >>> I have a few queries regarding this:-
> >>>
> >>> The exact reason for disconnection (Few articles claim memory and few
> network)
> This might be because of network failure, did you notice any network
> instability, could you check the TCP settings.
> You could check the following configurations tcp_keepalives_idle,
> tcp_keepalives_interval and tcp_keepalives_count.
> This means it will connect the server based on tcp_keepalives_idle
> seconds specified , if the server does not respond in
> tcp_keepalives_interval seconds it'll try again, and will consider the
> connection gone after tcp_keepalives_count failures. ---Yes you were
> correct, that ssue was related to network where VPN tunnel got restarted
> because of some miss configuration at tunnel side. By fixing that it
> stands resolved so far. These params were set to below values:-

1. keepalives_idle 60
2. keepalives_interval 100
3. keepalives_count 60

> >>> Will it lead to data inconsistency?
> It will not lead to inconsistency. In case of failure the failed
> transaction will be rolled back. Yes, Migration was up to the mark after
> fixing network.
>
> >>> Does this new PID COPY command again migrate the whole data of the
> test table once again?
> Yes, it will migrate the whole table data again in case of failures. Yes,
> I follow you on that. Is there any way to rsync instead of simple copy?
>
> Regards,
> Vignesh
>

--
Thanks and Regards,
Shaurya Jain
email:- 12345shaurya(at)gmail(dot)com
*Mobile:- +91-8802809405*
LinkedIn:- https://www.linkedin.com/in/shaurya-jain-74353023

In response to

Browse pgsql-general by date

  From Date Subject
Next Message jian he 2023-04-24 03:32:09 Re: alter table rename column can event trigger capture new column name
Previous Message shveta malik 2023-04-24 02:48:24 Re: Support logical replication of DDLs

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2023-04-24 03:10:12 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Kyotaro Horiguchi 2023-04-24 02:55:46 Re: Perform streaming logical transactions by background workers and parallel apply