Re: Replication slot stats misgivings

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Replication slot stats misgivings
Date: 2021-04-10 12:53:04
Message-ID: CAA4eK1Jt40-gHsFkinREKMKU+0C4RPgzYu3SwFezcZFc9srnfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 10, 2021 at 7:24 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> IIUC there are two problems in the case where the drop message is lost:
>
> 1. Writing beyond the end of replSlotStats.
> This can happen if after restarting the number of slots whose stats
> are stored in the stats file exceeds max_replication_slots. Vignesh's
> patch addresses this problem.
>
> 2. The stats for the new slot are not recorded.
> If the stats for already-dropped slots remain in replSlotStats, the
> stats for the new slot cannot be registered due to the full of
> replSlotStats. This can happen even when after restarting the number
> of slots whose stats are stored in the stat file does NOT exceed
> max_replication_slots as well as even during the server running. The
> patch doesn’t address this problem. (If this happens, we will have to
> reset all slot stats since pg_stat_reset_replication_slot() cannot
> remove the slot stats with the non-existing name).
>
> I think we can use HTAB to store slot stats and have
> pg_stat_get_replication_slot() inquire about stats by the slot name,
> resolving both problems. By using HTAB we're no longer concerned about
> the problem of writing stats beyond the end of the replSlotStats
> array. Instead, we have to consider how and when to clean up the stats
> for already-dropped slots. We can have the startup process send slot
> names at startup time, which borrows the idea proposed by Amit. But
> maybe we need to consider the case again where the message from the
> startup process is lost? Another idea would be to have
> pgstat_vacuum_stat() check the existing slots and call
> pgstat_report_replslot_drop() if the slot in the stats file doesn't
> exist. That way, we can continuously check the stats for
> already-dropped slots.
>

Agreed, I think checking periodically via pgstat_vacuum_stat is a
better idea then sending once at start up time. I also think using
slot_name is better than using 'idx' (index in
ReplicationSlotCtl->replication_slots) in this scheme because even
after startup 'idx' changes we will be able to drop the dead slot.

> I've written a PoC patch for the above idea; using HTAB and cleaning
> up slot stats at pgstat_vacuum_stat(). The patch can be applied on top
> of 0001 patch Vignesh proposed before[1].
>

It seems Vignesh has changed patches based on the latest set of
comments so you might want to rebase.

> Please note that this cannot resolve the problem of ending up
> accumulating the stats to the old slot if the slot is re-created with
> the same name and the drop message is lost. To deal with this problem
> I think we would need to use something unique identifier for each slot
> instead of slot name.
>

Right, we can probably write it in comments and or docs about this
caveat and the user can probably use pg_stat_reset_replication_slot
for such slots.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-04-10 12:53:28 Re: truncating timestamps on arbitrary intervals
Previous Message Peter Eisentraut 2021-04-10 11:42:57 Re: truncating timestamps on arbitrary intervals