Re: Review for GetWALAvailability()

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Review for GetWALAvailability()
Date: 2020-06-23 08:41:44
Message-ID: 20200623.174144.341250500168881179.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for looking this.

At Fri, 19 Jun 2020 18:23:59 -0400, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in
> On 2020-Jun-17, Kyotaro Horiguchi wrote:
>
> > @@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN)
> > * the first WAL segment file since startup, which causes the status being
> > * wrong under certain abnormal conditions but that doesn't actually harm.
> > */
> > - oldestSeg = XLogGetLastRemovedSegno() + 1;
> > + oldestSeg = last_removed_seg + 1;
> >
> > /* calculate oldest segment by max_wal_size and wal_keep_segments */
> > XLByteToSeg(currpos, currSeg, wal_segment_size);
>
> This hunk should have updated the comment two lines above. However:
>
> > @@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
> > rsinfo->setResult = tupstore;
> > rsinfo->setDesc = tupdesc;
> >
> > + /*
> > + * Remember the last removed segment at this point for the consistency in
> > + * this table. Since there's no interlock between slot data and
> > + * checkpointer, the segment can be removed in-between, but that doesn't
> > + * make any practical difference.
> > + */
> > + last_removed_seg = XLogGetLastRemovedSegno();
>
> I am mystified as to why you added this change. I understand that your
> point here is to make all slots reported their state as compared to the
> same LSN, but why do it like that? If a segment is removed in between,
> it could mean that the view reports more lies than it would if we update
> the segno for each slot. I mean, suppose two slots are lagging behind
> and one is reported as 'extended' because when we compute it it's still
> in range; then a segment is removed. With your coding, we'll report
> both as extended, but with the original coding, we'll report the new one
> as lost. By the time the user reads the result, they'll read one
> incorrect report with the original code, and two incorrect reports with
> your code. So ... yes it might be more consistent, but what does that
> buy the user?

I agree to you. Anyway the view may show "wrong" statuses if
concurrent WAL-file removal is running. But I can understand it is
better that the numbers in a view are consistent. The change
contributes only to that point. So I noted as "doesn't make any
practical difference". Since it is going to be removed, I removed the
changes for the part.

https://www.postgresql.org/message-id/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com

> OTOH it makes GetWALAvailability gain a new argument, which we have to
> document.
>
> > + /*
> > + * However segments required by the slot has been lost, if walsender is
> > + * active the walsender can read into the first reserved slot.
> > + */
> > + if (slot_is_active)
> > + return WALAVAIL_BEING_REMOVED;
>
> I don't understand this comment; can you please clarify what you mean?

I have had comments that the "lost" state should be a definite state,
that is, a state mustn't go back to other states. I had the same from
Fujii-san again.

Suppose we are starting from the following situation:

State A:
|---- seg n-1 ----|---- seg n ----|
^
X (restart_lsn of slot S) - max_slot_wal_keep_size

If the segment n-1 is removed, slot S's status becomes
"lost". However, if the walsender that is using the slot has not been
killed yet, the point X can move foward to the segment n (State B).

State B:
|XXXX seg n-1 XXXX|---- seg n ----|
^
X (restart_lsn of slot S) - max_slot_wal_keep_size

This is the normal (or extend) state. If we want to the state "lost"
to be definitive, we cannot apply the state label "lost" to State A if
it is active.

WALAVAIL_BEING_REMOVED (I noticed it has been removed for a wrong
reason so I revived it in this patch [1].) was used for the same state,
that is, the segment at restart_lsn will be removed soon but not yet.

1: https://www.postgresql.org/message-id/20200406.185027.648866525989475817.horikyota.ntt@gmail.com

> I admit I don't like this slot_is_active argument you propose to add to
> GetWALAvailability either; previously the function can be called with
> an LSN coming from anywhere, not just a slot; the new argument implies
> that the LSN comes from a slot. (Your proposed patch doesn't document
> this one either.)

Agreed. I felt like you at the time. I came up with another way after
hearing that from you.

In the attached GetWALAvailability() returns the state assuming the
walsender is not active. And the caller (pg_get_replication_slots())
considers the case where the walsender is active.

regares.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Some-fixes-of-pg_replication_slots.wal_status.patch text/x-patch 16.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Danzberger 2020-06-23 09:40:53 Re: BUG #15293: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events
Previous Message Kyotaro Horiguchi 2020-06-23 08:08:14 Re: min_safe_lsn column in pg_replication_slots view