Re: [HACKERS] Restricting maximum keep segments by repslots

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-03-31 21:01:36
Message-ID: 20200331210136.GA6633@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I noticed some other things:

1. KeepLogSeg sends a warning message when slots fall behind. To do
this, it searches for "the most affected slot", that is, the slot that
lost the most data. But it seems to me that that's a bit pointless; if
a slot data, it's now useless and anything that was using that slot must
be recreated. If you only know what's the most affected slot, it's not
possible to see which *other* slots are affected. It doesn't matter if
the slot missed one segment or twenty segments or 9999 segments -- the
slot is now useless, or it is not useless. I think we should list the
slot that was *least* affected, i.e., the slot that lost the minimum
amount of segments; then the user knows that all slots that are older
than that one are *also* affected.

2. KeepLogSeg ignores slots that are active. I guess the logic here is
that if a slot is active, then it'll keep going until it catches up and
we don't need to do anything about the used disk space. But that seems
a false premise, because if a standby is so slow that it cannot keep up,
it will eventually run the master out of diskspace even if it's active
all the time. So I'm not seeing the reasoning that makes it useful to
skip checking active slots.

(BTW I don't think you need to keep that many static variables in that
function. Just the slot name should be sufficient, I think ... or maybe
even the *pointer* to the slot that was last reported.

I think if a slot is behind and it lost segments, we should kill the
walsender that's using it, and unreserve the segments. So maybe
something like

LWLockAcquire( ... );
for (i = 0 ; i < max_replication_slots; i++)
{
ReplicationSlot *s =
&ReplicationSlotCtl->replication_slots[i];
XLogSegNo slotSegNo;

XLByteToSeg(s->data.restart_lsn, slotSegNo, wal_segment_size);

if (s->in_use)
{
if (s->active_pid)
pids_to_kill = lappend(pids_to_kill, s->active_pid);

nslots_affected++;
... ; /* other stuff */
}
}
LWLockRelease( ... )
/* release lock before syscalls */
foreach(l, pids_to_kill)
{
kill(SIGTERM, lfirst_int(l));
}

I sense some attempt to salvage slots that are reading a segment that is
"outdated" and removed, but for which the walsender has an open file
descriptor. (This appears to be the "losing" state.) This seems
dangerous, for example the segment might be recycled and is being
overwritten with different data. Trying to keep track of that seems
doomed. And even if the walsender can still read that data, it's only a
matter of time before the next segment is also removed. So keeping the
walsender alive is futile; it only delays the inevitable.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-03-31 21:09:49 Re: Rethinking opclass member checks and dependency strength
Previous Message Tom Lane 2020-03-31 20:48:30 Re: Berserk Autovacuum (let's save next Mandrill)