Re: Synchronizing slots from primary to standby

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-26 07:29:07
Message-ID: Zdw9w4jZzzN1KEV6@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Feb 26, 2024 at 02:18:58AM +0000, Zhijie Hou (Fujitsu) wrote:
> On Friday, February 23, 2024 6:12 PM Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > + if (!ok)
> > + GUC_check_errdetail("List syntax is invalid.");
> > +
> > + /*
> > + * If there is a syntax error in the name or if the replication slots'
> > + * data is not initialized yet (i.e., we are in the startup process), skip
> > + * the slot verification.
> > + */
> > + if (!ok || !ReplicationSlotCtl)
> > + {
> > + pfree(rawname);
> > + list_free(elemlist);
> > + return ok;
> > + }
> >
> > we are testing the "ok" value twice, what about using if...else if... instead and
> > test it once? If so, it might be worth to put the:
> >
> > "
> > + pfree(rawname);
> > + list_free(elemlist);
> > + return ok;
> > "
> >
> > in a "goto".
>
> There were comments to remove the 'goto' statement and avoid
> duplicate free code, so I prefer the current style.

The duplicate free code would come from the if...else if... rewrite but then
the "goto" would remove it, so I'm not sure to understand your point.

> >
> > 7 ===
> >
> > + * Return a copy of standby_slot_names_list if the copy flag is set to
> > + true,
> >
> > Not sure, but would it be worth explaining why one would want to set to flag to
> > true or false? (i.e why one would not want to receive the original list).
>
> I think the usage can be found from the caller's code, e.g we need to remove
> the slots that caught up from the list each time, so we cannot directly modify
> the global list. The GetStandbySlotList function is general function and I feel
> we can avoid adding more comments here.

Okay, yeah makes sense.

> >
> > 10 ===
> >
> > + if (slot->data.invalidated != RS_INVAL_NONE)
> > + {
> > + /*
> > + * Specified physical slot have been invalidated,
> > so no point
> > + * in waiting for it.
> >
> > We discovered in [1], that if the wal_status is "unreserved" then the slot is still
> > serving the standby. I think we should handle this case differently, thoughts?
>
> I think the 'invalidated' slot can still be used is a separate bug.
> Because
> once the slot is invalidated, it can neither protect WALs or ROWs from being
> removed even if the restart_lsn of the slot can be moved forward after being invalidated.
>
> If the standby can move restart_lsn forward for invalidated slots, then
> it should also set the 'invalidated' flag back to NONE, otherwise the slot
> cannot serve its purpose anymore. I also reported similar bug before[1].

I see. But should'nt we add a check on restart_lsn as this is done here in
pg_get_replication_slots()?

"
case WALAVAIL_REMOVED:

/*
* If we read the restart_lsn long enough ago, maybe that file
* has been removed by now. However, the walsender could have
* moved forward enough that it jumped to another file after
* we looked. If checkpointer signalled the process to
* termination, then it's definitely lost; but if a process is
* still alive, then "unreserved" seems more appropriate.

if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))

"

My point is that I think we should behave like it's not a bug and then adapt the
code accordingly here (until the bug gets fixed).

Currently we are not waiting for this slot while it's still serving the standby
which does not seem good too, thoughts?

> Attach the V98 patch set which addressed above comments.
> I also adjusted few comments based on off-list comments from Shveta.

Thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2024-02-26 07:29:44 Re: ExecAppendAsyncEventWait() in REL_14_STABLE can corrupt PG_exception_stack
Previous Message Jelte Fennema-Nio 2024-02-26 07:20:05 Re: [PATCH] Add --syntax to postgres for SQL syntax checking