Re: Reviving lost replication slots

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: sirisha chamarthi <sirichamarthi22(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Reviving lost replication slots
Date: 2022-11-08 07:17:23
Message-ID: CALj2ACWeT-HxX0xtHY-5e+aEBXGA-7s6hkstr1-8V55v5YK-2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
<sirichamarthi22(at)gmail(dot)com> wrote:
>
> On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>> <sirichamarthi22(at)gmail(dot)com> wrote:
>> >
>> > A replication slot can be lost when a subscriber is not able to catch up with the load on the primary and the WAL to catch up exceeds max_slot_wal_keep_size. When this happens, target has to be reseeded (pg_dump) from the scratch and this can take longer. I am investigating the options to revive a lost slot.
>> >
>>
>> Why in the first place one has to set max_slot_wal_keep_size if they
>> care for WAL more than that?
>
> Disk full is a typical use where we can't wait until the logical slots to catch up before truncating the log.

If the max_slot_wal_keep_size is set appropriately and the replication
lag is monitored properly along with some automatic actions such as
replacing/rebuilding the standbys or subscribers (which may not be
easy and cheap though), the chances of hitting the "lost replication"
problem becomes less, but not zero always.

>> If you have a case where you want to
>> handle this case for some particular slot (where you are okay with the
>> invalidation of other slots exceeding max_slot_wal_keep_size) then the
>> other possibility could be to have a similar variable at the slot
>> level but not sure if that is a good idea because you haven't
>> presented any such case.
>
> IIUC, ability to fetch WAL from the archive as a fall back mechanism should automatically take care of all the lost slots. Do you see a need to take care of a specific slot? If the idea is not to download the wal files in the pg_wal directory, they can be placed in a slot specific folder (data/pg_replslot/<slot>/) until they are needed while decoding and can be removed.

Is the idea here the core copying back the WAL files from the archive?
If yes, I think it is not something the core needs to do. This very
well fits the job of an extension or an external module that revives
the lost replication slots by copying WAL files from archive location.

Having said above, what's the best way to revive a lost replication
slot today? Any automated way exists today? It seems like
pg_replication_slot_advance() doesn't do anything for the
invalidated/lost slots.

If it's a streaming replication slot, the standby will anyway jump to
archive mode ignoring the replication slot and the slot will never be
usable again unless somebody creates a new replication slot and
provides it to the standby for reuse.
If it's a logical replication slot, the subscriber will start to
diverge from the publisher and the slot will have to be revived
manually i.e. created again.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-11-08 07:46:26 Re: Suppressing useless wakeups in walreceiver
Previous Message rajesh singarapu 2022-11-08 06:58:30 Re: Use proc instead of MyProc in ProcArrayGroupClearXid()/TransactionGroupUpdateXidStatus()