From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | alvherre(at)2ndquadrant(dot)com |
Cc: | masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Review for GetWALAvailability() |
Date: | 2020-06-23 08:41:44 |
Message-ID: | 20200623.174144.341250500168881179.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thanks for looking this.
At Fri, 19 Jun 2020 18:23:59 -0400, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in
> On 2020-Jun-17, Kyotaro Horiguchi wrote:
>
> > @@ -9524,7 +9533,7 @@ GetWALAvailability(XLogRecPtr targetLSN)
> > * the first WAL segment file since startup, which causes the status being
> > * wrong under certain abnormal conditions but that doesn't actually harm.
> > */
> > - oldestSeg = XLogGetLastRemovedSegno() + 1;
> > + oldestSeg = last_removed_seg + 1;
> >
> > /* calculate oldest segment by max_wal_size and wal_keep_segments */
> > XLByteToSeg(currpos, currSeg, wal_segment_size);
>
> This hunk should have updated the comment two lines above. However:
>
> > @@ -272,6 +273,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
> > rsinfo->setResult = tupstore;
> > rsinfo->setDesc = tupdesc;
> >
> > + /*
> > + * Remember the last removed segment at this point for the consistency in
> > + * this table. Since there's no interlock between slot data and
> > + * checkpointer, the segment can be removed in-between, but that doesn't
> > + * make any practical difference.
> > + */
> > + last_removed_seg = XLogGetLastRemovedSegno();
>
> I am mystified as to why you added this change. I understand that your
> point here is to make all slots reported their state as compared to the
> same LSN, but why do it like that? If a segment is removed in between,
> it could mean that the view reports more lies than it would if we update
> the segno for each slot. I mean, suppose two slots are lagging behind
> and one is reported as 'extended' because when we compute it it's still
> in range; then a segment is removed. With your coding, we'll report
> both as extended, but with the original coding, we'll report the new one
> as lost. By the time the user reads the result, they'll read one
> incorrect report with the original code, and two incorrect reports with
> your code. So ... yes it might be more consistent, but what does that
> buy the user?
I agree to you. Anyway the view may show "wrong" statuses if
concurrent WAL-file removal is running. But I can understand it is
better that the numbers in a view are consistent. The change
contributes only to that point. So I noted as "doesn't make any
practical difference". Since it is going to be removed, I removed the
changes for the part.
https://www.postgresql.org/message-id/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com
> OTOH it makes GetWALAvailability gain a new argument, which we have to
> document.
>
> > + /*
> > + * However segments required by the slot has been lost, if walsender is
> > + * active the walsender can read into the first reserved slot.
> > + */
> > + if (slot_is_active)
> > + return WALAVAIL_BEING_REMOVED;
>
> I don't understand this comment; can you please clarify what you mean?
I have had comments that the "lost" state should be a definite state,
that is, a state mustn't go back to other states. I had the same from
Fujii-san again.
Suppose we are starting from the following situation:
State A:
|---- seg n-1 ----|---- seg n ----|
^
X (restart_lsn of slot S) - max_slot_wal_keep_size
If the segment n-1 is removed, slot S's status becomes
"lost". However, if the walsender that is using the slot has not been
killed yet, the point X can move foward to the segment n (State B).
State B:
|XXXX seg n-1 XXXX|---- seg n ----|
^
X (restart_lsn of slot S) - max_slot_wal_keep_size
This is the normal (or extend) state. If we want to the state "lost"
to be definitive, we cannot apply the state label "lost" to State A if
it is active.
WALAVAIL_BEING_REMOVED (I noticed it has been removed for a wrong
reason so I revived it in this patch [1].) was used for the same state,
that is, the segment at restart_lsn will be removed soon but not yet.
1: https://www.postgresql.org/message-id/20200406.185027.648866525989475817.horikyota.ntt@gmail.com
> I admit I don't like this slot_is_active argument you propose to add to
> GetWALAvailability either; previously the function can be called with
> an LSN coming from anywhere, not just a slot; the new argument implies
> that the LSN comes from a slot. (Your proposed patch doesn't document
> this one either.)
Agreed. I felt like you at the time. I came up with another way after
hearing that from you.
In the attached GetWALAvailability() returns the state assuming the
walsender is not active. And the caller (pg_get_replication_slots())
considers the case where the walsender is active.
regares.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachment | Content-Type | Size |
---|---|---|
0001-Some-fixes-of-pg_replication_slots.wal_status.patch | text/x-patch | 16.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Danzberger | 2020-06-23 09:40:53 | Re: BUG #15293: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events |
Previous Message | Kyotaro Horiguchi | 2020-06-23 08:08:14 | Re: min_safe_lsn column in pg_replication_slots view |