Re: Reviving lost replication slots

From: sirisha chamarthi <sirichamarthi22(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Reviving lost replication slots
Date: 2022-11-09 03:39:58
Message-ID: CAKrAKeUMHbM7gUDNa1zwugFak-FpqnF+ZwV29e8HRvytXPkWVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 7, 2022 at 11:17 PM Bharath Rupireddy <
bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:

> On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi
> <sirichamarthi22(at)gmail(dot)com> wrote:
> >
> > On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >>
> >> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
> >> <sirichamarthi22(at)gmail(dot)com> wrote:
> >> >
> >> > A replication slot can be lost when a subscriber is not able to catch
> up with the load on the primary and the WAL to catch up exceeds
> max_slot_wal_keep_size. When this happens, target has to be reseeded
> (pg_dump) from the scratch and this can take longer. I am investigating the
> options to revive a lost slot.
> >> >
> >>
> >> Why in the first place one has to set max_slot_wal_keep_size if they
> >> care for WAL more than that?
> >
> > Disk full is a typical use where we can't wait until the logical slots
> to catch up before truncating the log.
>
> If the max_slot_wal_keep_size is set appropriately and the replication
> lag is monitored properly along with some automatic actions such as
> replacing/rebuilding the standbys or subscribers (which may not be
> easy and cheap though), the chances of hitting the "lost replication"
> problem becomes less, but not zero always.
>

pg_dump and pg_restore can take several hours to days on a large database.
Keeping the WAL in the pg_wal folder (faster, smaller and costly disks?) is
not always an option.

>
> >> If you have a case where you want to
> >> handle this case for some particular slot (where you are okay with the
> >> invalidation of other slots exceeding max_slot_wal_keep_size) then the
> >> other possibility could be to have a similar variable at the slot
> >> level but not sure if that is a good idea because you haven't
> >> presented any such case.
> >
> > IIUC, ability to fetch WAL from the archive as a fall back mechanism
> should automatically take care of all the lost slots. Do you see a need to
> take care of a specific slot? If the idea is not to download the wal files
> in the pg_wal directory, they can be placed in a slot specific folder
> (data/pg_replslot/<slot>/) until they are needed while decoding and can be
> removed.
>
> Is the idea here the core copying back the WAL files from the archive?
> If yes, I think it is not something the core needs to do. This very
> well fits the job of an extension or an external module that revives
> the lost replication slots by copying WAL files from archive location.
>

The current code is throwing an error that the slot is lost because the
restart_lsn is set to invalid LSN when the WAL is truncated by
checkpointer. In order to build an external service that can revive a lost
slot, at the minimum we needed the patch attached.

>
> Having said above, what's the best way to revive a lost replication
> slot today? Any automated way exists today? It seems like
> pg_replication_slot_advance() doesn't do anything for the
> invalidated/lost slots.
>

If the WAL is available in the pg_wal directory, the replication stream
resumes normally when the client connects with the patch I posted.

>
> If it's a streaming replication slot, the standby will anyway jump to
> archive mode ignoring the replication slot and the slot will never be
> usable again unless somebody creates a new replication slot and
> provides it to the standby for reuse.
> If it's a logical replication slot, the subscriber will start to
> diverge from the publisher and the slot will have to be revived
> manually i.e. created again.
>

Physical slots can be revived with standby downloading the WAL from the
archive directly. This patch is helpful for the logical slots.

>
> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-11-09 04:09:01 Re: Allow file inclusion in pg_hba and pg_ident files
Previous Message sirisha chamarthi 2022-11-09 03:26:45 Re: Reviving lost replication slots