Re: speed up a logical replica setup

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Subject: Re: speed up a logical replica setup
Date: 2024-01-10 04:33:51
Message-ID: CANhcyEUCt-g4JLQU3Q3ofFk_Vt-Tqh3ZdXoLcpT8fjz9LY_-ww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 5 Jan 2024 at 12:19, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> On Thu, 4 Jan 2024 at 16:46, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 4, 2024 at 12:22 PM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> > >
> > > Hi,
> > > I was testing the patch with following test cases:
> > >
> > > Test 1 :
> > > - Create a 'primary' node
> > > - Setup physical replica using pg_basebackup "./pg_basebackup –h
> > > localhost –X stream –v –R –W –D ../standby "
> > > - Insert data before and after pg_basebackup
> > > - Run pg_subscriber and then insert some data to check logical
> > > replication "./pg_subscriber –D ../standby -S “host=localhost
> > > port=9000 dbname=postgres” -P “host=localhost port=9000
> > > dbname=postgres” -d postgres"
> > > - Also check pg_publication, pg_subscriber and pg_replication_slots tables.
> > >
> > > Observation:
> > > Data is not lost. Replication is happening correctly. Pg_subscriber is
> > > working as expected.
> > >
> > > Test 2:
> > > - Create a 'primary' node
> > > - Use normal pg_basebackup but don’t set up Physical replication
> > > "./pg_basebackup –h localhost –v –W –D ../standby"
> > > - Insert data before and after pg_basebackup
> > > - Run pg_subscriber
> > >
> > > Observation:
> > > Pg_subscriber command is not completing and is stuck with following
> > > log repeating:
> > > LOG: waiting for WAL to become available at 0/3000168
> > > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> > >
> >
> > I think probably the required WAL is not copied. Can you use the -X
> > option to stream WAL as well and then test? But I feel in this case
> > also, we should wait for some threshold time and then exit with
> > failure, removing new objects created, if any.
>
> I have tested with -X stream option in pg_basebackup as well. In this
> case also the pg_subscriber command is getting stuck.
> logs:
> 2024-01-05 11:49:34.436 IST [61948] LOG: invalid resource manager ID
> 102 at 0/3000118
> 2024-01-05 11:49:34.436 IST [61948] LOG: waiting for WAL to become
> available at 0/3000130
>
> >
> > > Test 3:
> > > - Create a 'primary' node
> > > - Use normal pg_basebackup but don’t set up Physical replication
> > > "./pg_basebackup –h localhost –v –W –D ../standby"
> > > -Insert data before pg_basebackup but not after pg_basebackup
> > > -Run pg_subscriber
> > >
> > > Observation:
> > > Pg_subscriber command is not completing and is stuck with following
> > > log repeating:
> > > LOG: waiting for WAL to become available at 0/3000168
> > > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> > >
> >
> > This is similar to the previous test and you can try the same option
> > here as well.
> For this test as well tried with -X stream option in pg_basebackup.
> It is getting stuck here as well with similar log.
>
> Will investigate the issue further.

I noticed that the pg_subscriber get stuck when we run it on node
which is not a standby. It is because the of the code:
+ conn = connect_database(dbinfo[0].pubconninfo);
+ if (conn == NULL)
+ exit(1);
+ consistent_lsn = create_logical_replication_slot(conn, &dbinfo[0],
+ temp_replslot);
+
.....
+else
+ {
+ appendPQExpBuffer(recoveryconfcontents, "recovery_target_lsn = '%s'\n",
+ consistent_lsn);
+ WriteRecoveryConfig(conn, subscriber_dir, recoveryconfcontents);
+ }

Here the standby node would be waiting for the 'consistent_lsn' wal
during recovery but this wal will not be present on standby if no
physical replication is setup. Hence the command will be waiting
infinitely for the wal.
To solve this added a timeout of 60s for the recovery process and also
added a check so that pg_subscriber would give a error when it called
for node which is not in physical replication.
Have attached the patch for the same. It is a top-up patch of the
patch shared by Euler at [1].

Please review the changes and merge the changes if it looks ok.

[1] - https://www.postgresql.org/message-id/e02a2c17-22e5-4ba6-b788-de696ab74f1e%40app.fastmail.com

Thanks and regards
Shlok Kyal

Attachment Content-Type Size
v1-0001-Restrict-pg_subscriber-to-standby-node.patch application/octet-stream 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-01-10 04:41:03 Re: A recent message added to pg_upgade
Previous Message Tom Lane 2024-01-10 04:30:23 Re: Make NUM_XLOGINSERT_LOCKS configurable