Re: [HACKERS] Restricting maximum keep segments by repslots

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-03 23:14:03
Message-ID: 20200403231403.GA14901@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So, the more I look at this patch, the less I like the way the slots are
handled.

* I think it's a mistake to try to do anything in KeepLogSeg itself;
that function is merely in charge of some arithmetic. I propose to
make that function aware of the new size limitation (so that it
doesn't trust the slot's LSNs completely), but to make the function
have no side effects. The attached patch does that, I hope.
To replace that responsibility, let's add another function. I named it
InvalidateObsoleteReplicationSlots(). In CreateCheckPoint and
CreateRestartPoint, we call the new function just before removing
segments. Note: the one in this patch doesn't actually work or even
compile.
The new function must:

1. mark the slot as "invalid" somehow. Maybe it would make sense to
add a new flag in the on-disk struct for this; but for now I'm just
thinking that changing the slot's restart_lsn is sufficient.
(Of course, I haven't tested this, so there might be side-effects that
mean that this idea doesn't work).

2. send SIGTERM to a walsender that's using such a slot.

3. Send the warning message. Instead of trying to construct a message
with a list of slots, send one message per slot. (I think this is
better from a translatability point of view, and also from a
monitoring PoV).

* GetWalAvailability seems too much in competition with
DistanceToWalRemoval. Which is weird, because both functions do
pretty much the same thing. I think a better design is to make the
former function return the distance as an out parameter.

* Andres complained that the "distance" column was not a great value to
expose (20171106132050(dot)6apzynxrqrzghb4r(at)alap3(dot)anarazel(dot)de). That's
right: it changes both by the insertion LSN as well as the slot's
consumption. Maybe we can expose the earliest live LSN (start of the
earliest segment?) as a new column. It'll be the same for all slots,
I suppose, but we don't care, do we?

I attach a rough sketch, which as I said before doesn't work and doesn't
compile. Sadly I have reached the end of my day here so I won't be able
to work on this for today anymore. I'll be glad to try again tomorrow,
but in the meantime I thought it was better to send it over and see
whether you had any thoughts about this proposed design (maybe you know
it doesn't work for some reason), or better yet, you have the chance to
actually complete the code or at least move it a little further.

Thanks

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v23-0001-Add-WAL-relief-vent-for-replication-slots.patch text/x-diff 29.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-04-03 23:18:32 Re: vacuum_defer_cleanup_age inconsistently applied on replicas
Previous Message Tom Lane 2020-04-03 23:02:13 Re: backup manifests and contemporaneous buildfarm failures