Re: Synchronizing slots from primary to standby

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-07-24 02:32:56
Message-ID: CALj2ACV+VX9McnogGNyFCjZW+qnPvdmjnBjttotygs8+7D5JuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> Thanks Bharat for letting us know. It is okay to split the patch, it
> may definitely help to understand the modules better but shall we take
> a step back and try to reevaluate the design first before moving to
> other tasks?

Agree that design comes first. FWIW, I'm attaching the v9 patch set
that I have with me. It can't be a perfect patch set unless the design
is finalized.

> I analyzed more on the issues stated in [1] for replacing LIST_SLOTS
> with SELECT query. On rethinking, it might not be a good idea to
> replace this cmd with SELECT in Launcher code-path

I think there are open fundamental design aspects, before optimizing
LIST_SLOTS, see below. I'm sure we can come back to this later.

> Secondly, I was thinking if the design proposed in the patch is the
> best one. No doubt, it is the most simplistic design and thus may
> .......... Any feedback is appreciated.

Here are my thoughts about this feature:

Current design:

1. On primary, never allow walsenders associated with logical
replication slots to go ahead of physical standbys that are candidates
for future primary after failover. This enables subscribers to connect
to new primary after failover.
2. On all candidate standbys, periodically sync logical slots from
primary (creating the slots if necessary) with one slot sync worker
per logical slot.

Important considerations:

1. Does this design guarantee the row versions required by subscribers
aren't removed on candidate standbys as raised here -
https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de?

It seems safe with logical decoding on standbys feature. Also, a
test-case from upthread is already in patch sets (in v9 too)
https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com.
However, we need to verify the use cases extensively.

2. All candidate standbys will start one slot sync worker per logical
slot which might not be scalable. Is having one (or a few more - not
necessarily one for each logical slot) worker for all logical slots
enough?

It seems safe to have one worker for all logical slots - it's not a
problem even if the worker takes a bit of time to get to sync a
logical slot on a candidate standby, because the standby is ensured to
retain all the WAL and row versions required to decode and send to the
logical slots.

3. Indefinite waiting of logical walsenders for candidate standbys may
not be a good idea. Is having a timeout for logical walsenders a good
idea?

A problem with timeout is that it can make logical slots unusable
after failover.

4. All candidate standbys retain WAL required by logical slots. Amount
of WAL retained may be huge if there's a replication lag with logical
replication subscribers.

This turns out to be a typical problem with replication, so there's
nothing much this feature can do to prevent WAL file accumulation
except for asking one to monitor replication lag and WAL file growth.

5. Logical subscribers replication lag will depend on all candidate
standbys replication lag. If candidate standbys are too far from
primary and logical subscribers are too close, still logical
subscribers will have replication lag. There's nothing much this
feature can do to prevent this except for calling it out in
documentation.

6. This feature might need to prevent the GUCs from deviating on
primary and the candidate standbys - there's no point in syncing a
logical slot on candidate standbys if logical walsender related to it
on primary isn't keeping itself behind all the candidate standbys. If
preventing this from happening proves to be tough, calling it out in
documentation to keep GUCs the same is a good start.

7. There are some important review comments provided upthread as far
as this design and patches are concerned -
https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de
and https://www.postgresql.org/message-id/20220207203222.22aktwxrt3fcllru%40alap3.anarazel.de.
I'm sure we can come to these once the design is clear.

Please feel free to add the list if I'm missing anything.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v9-0001-Allow-logical-walsenders-to-wait-for-physical-sta.patch application/x-patch 21.1 KB
v9-0002-Add-logical-slot-sync-capability-to-physical-stan.patch application/x-patch 53.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2023-07-24 02:45:44 Re: postgres_fdw: wrong results with self join + enable_nestloop off
Previous Message Andrey Lepikhov 2023-07-24 02:10:32 Re: POC: GROUP BY optimization