RE: speed up a logical replica setup

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Euler Taveira' <euler(at)eulerto(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
Subject: RE: speed up a logical replica setup
Date: 2024-02-09 12:03:15
Message-ID: TYCPR01MB1207713BEC5C379A05D65E342F54B2@TYCPR01MB12077.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Euler,

Further comments for v17.

01.
This program assumes that the target server has same major version with this.
Because the target server would be restarted by same version's pg_ctl command.
I felt it should be ensured by reading the PG_VERSION.

02.
pg_upgrade checked the version of using executables, like pg_ctl, postgres, and
pg_resetwal. I felt it should be as well.

03. get_bin_directory
```
if (find_my_exec(path, full_path) < 0)
{
pg_log_error("The program \"%s\" is needed by %s but was not found in the\n"
"same directory as \"%s\".\n",
"pg_ctl", progname, full_path);
```

s/"pg_ctl"/progname

04.
Missing canonicalize_path()?

05.
Assuming that the target server is a cascade standby, i.e., it has a role as
another primary. In this case, I thought the child node would not work. Because
pg_createsubcriber runs pg_resetwal and all WAL files would be discarded at that
time. I have not tested, but should the program detect it and exit earlier?

06.
wait_for_end_recovery() waits forever even if the standby has been disconnected
from the primary, right? should we check the status of the replication via
pg_stat_wal_receiver?

07.
The cleanup function has couple of bugs.

* If subscriptions have been created on the database, the function also tries to
drop a publication. But it leads an ERROR because it has been already dropped.
See setup_subscriber().
* If the subscription has been created, drop_replication_slot() leads an ERROR.
Because the subscriber tried to drop the subscription while executing DROP SUBSCRIPTION.

08.
I found that all messages (ERROR, WARNING, INFO, etc...) would output to stderr,
but I felt it should be on stdout. Is there a reason? pg_dump outputs messages to
stderr, but the motivation might be to avoid confusion with dumps.

09.
I'm not sure the cleanup for subscriber is really needed. Assuming that there
are two databases, e.g., pg1 pg2 , and we fail to create a subscription on pg2.
This can happen when the subscription which has the same name has been already
created on the primary server.
In this case a subscirption pn pg1 would be removed. But what is a next step?
Since a timelineID on the standby server is larger than the primary (note that
the standby has been promoted once), we cannot resume the physical replication
as-is. IIUC the easiest method to retry is removing a cluster once and restarting
from pg_basebackup. If so, no need to cleanup the standby because it is corrupted.
We just say "Please remove the cluster and recreate again".

Here is a reproducer.

1. apply the txt patch atop 0001 patch.
2. run test_corruption.sh.
3. when you find a below output [1], connect to a testdb from another terminal and
run CREATE SUBSCRITPION for the same subscription on the primary
4. Finally, pg_createsubscriber would fail the creation.

I also attached server logs of both nodes and the output.
Note again that this is a real issue. I used a tricky way for surely overlapping name,
but this can happen randomly.

10.
While investigating #09, I found that we cannot report properly a reason why the
subscription cannot be created. The output said:

```
pg_createsubscriber: error: could not create subscription "pg_createsubscriber_16389_3884" on database "testdb": out of memory
```

But the standby serverlog said:

```
ERROR: subscription "pg_createsubscriber_16389_3884" already exists
STATEMENT: CREATE SUBSCRIPTION pg_createsubscriber_16389_3884 CONNECTION 'user=postgres port=5431 dbname=testdb' PUBLICATION pg_createsubscriber_16389 WITH (create_slot = false, copy_data = false, enabled = false)
```

[1]
```
pg_createsubscriber: creating the replication slot "pg_createsubscriber_16389_3884" on database "testdb"
pg_createsubscriber: XXX: sleep 20s
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

Attachment Content-Type Size
N1.log application/octet-stream 3.5 KB
result.dat application/octet-stream 3.4 KB
server_start_20240209T115112.963.log application/octet-stream 2.9 KB
test_corruption.sh application/octet-stream 1.4 KB
add_sleep.txt text/plain 655 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2024-02-09 12:18:13 Re: [PATCH] Add native windows on arm64 support
Previous Message Shubham Khanna 2024-02-09 11:48:35 Re: speed up a logical replica setup