Re: [Patch] ALTER SYSTEM READ ONLY

From: Amul Sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] ALTER SYSTEM READ ONLY
Date: 2021-01-28 12:16:28
Message-ID: CAAJ_b97xX-nqRyM_uXzecpH9aSgoMROrDNhrg1N51fDCDwoy2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 26, 2021 at 2:38 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul(at)gmail(dot)com> wrote:
> > It is not like that, let me explain. When a user backend requests to alter WAL
> > prohibit state by using ASRO/ASRW DDL with the previous patch or calling
> > pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
> > set to the transition state i.e. going-read-only or going-read-write if it is
> > not already. If another backend trying to request the same alteration to the
> > wal prohibit state then nothing going to be changed in shared memory but that
> > backend needs to wait until the transition to the final wal prohibited state
> > completes. If a backend tries to request for the opposite state than the
> > previous which is in progress then it will see an error as "system state
> > transition to read only/write is already in progress". At a time only one
> > transition state can be set.
>
> Hrm. Well, then that needs to be abundantly clear in the relevant comments.
>
> > Now, I am a little bit concerned about the current function name. How about
> > pg_set_wal_prohibit_state(bool) name or have two functions as
> > pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
> > other suggestions?
>
> How about pg_prohibit_wal(true|false)?
>

LGTM. Used this.

> > > It seems odd that ProcessWALProhibitStateChangeRequest() returns
> > > without doing anything if !AmCheckpointerProcess(), rather than having
> > > that be an Assert(). Why is it like that?
> >
> > Like AbsorbSyncRequests().
>
> Well, that can be called not from the checkpointer, according to the
> comments. Specifically from the postmaster, I guess. Again, comments
> please.
>

Done.

> > If you think that there will be panic when UpdateFullPageWrites() and/or
> > XLogReportParameters() tries to write WAL since the shared memory state for WAL
> > prohibited is set then it is not like that. For those functions, WAL write is
> > explicitly enabled by calling LocalSetXLogInsertAllowed().
> >
> > I was under the impression that there won't be any problem if we allow the
> > writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be
> > considered as an exception since it is fine that this WAL record is not streamed
> > to standby while graceful failover, I may be wrong though.
>
> I don't think that's OK. I mean, the purpose of the feature is to
> prohibit WAL. If it doesn't do that, I believe it will fail to satisfy
> the principle of least surprise.
>

Yes, you are correct.

I am still on this. The things that worried me here are the wal records sequence
being written in the startup process -- UpdateFullPageWrites() generate record
just before the recovery check-point record and XLogReportParameters() just
after that but before any other backend could write any wal record. We might
also need to follow the same sequence while changing the system to read-write.

But in our case maintaining this sequence seems to be a little difficult. let me
explain, when a backend executes a function (ie. pg_prohibit_wal(false)) to
make the system read-write then that system state changes will be conveyed by
the Checkpointer process to all existing backends using global barrier and then
checkpoint might want to write those records. While checkpoint in progress, few
existing backends who might have absorbed barriers can write new records that
might come before aforesaid wal record sequence to be written. Also, we might
think that we could write these records before emitting the super barrier which
also might not solve the problem because a new backend could connect the server
just after the read-write system state change request was made but before
Checkpointer could pick that. Such a backend could write WAL before the
Checkpointer could, (see IsWALProhibited()).

Apart from this I also had a thought on the point recovery_end_command execution
that happens just after the recovery end checkpoint in the Startup process.
I think, first of all, why should we go and execute this command if we are
read-only? I don't think there will be any use to boot-up a read-only server
as standby, which itself is read-only to some extent. Also, pg_basebackup from
read-only is not allowed, a new standby cannot be set up. I think,
IMHO, we should simply error-out if tried to boot-up read-only server as
standby using standby.signal file, thoughts?

> > I am not clear on this part. In the attached version I am sending SIGUSR1
> > instead of SIGINT, which works for me.
>
> OK.
>
> > The attached version does not address all your comments, I'll continue my work
> > on that.
>
> Some thoughts on this version:
>
> +/* Extract last two bits */
> +#define WALPROHIBIT_CURRENT_STATE(stateGeneration) \
> + ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
> +#define WALPROHIBIT_NEXT_STATE(stateGeneration) \
> + WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
>
> This is really confusing. First, the comment looks like it applies to
> both based on how it is positioned, but that's clearly not true.
> Second, the naming is really hard to understand. Third, there don't
> seem to be comments explaining the theory of what is going on here.
> Fourth, stateGeneration refers not to which generation of state we've
> got here but to the combination of the state and the generation.
> However, it's not clear that we ever really use the generation for
> anything.
>
> I think that the direction you went with this is somewhat different
> from what I had in mind. That may be OK, but let me just explain the
> difference. We both had in mind the idea that the low two bits of the
> state would represent the current state and the upper bits would
> represent the state generation. However, I wasn't necessarily
> imagining that the only supported operation was making the combined
> value go up by 1. For instance, I had thought that perhaps the effect
> of trying to go read-only when we're in the middle of going read-write
> would be to cancel the previous operation and start the new one. What
> you have instead is that it errors out. So in your model a change
> always has to finish before the next one can start, which in turn
> means that the sequence is completely linear. In my idea the
> state+generation might go from say 1 to 7, because trying to go
> read-write would cancel the previous attempt to go read-only and
> replace it with an attempt to go the other direction, and from 7 we
> might go to to 9 if somebody now tries to go read-only again before
> that finishes. In your model, there's never any sort of cancellation
> of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc.
>

Yes, that made implementation quite simple. I was under the impression that we
might not have that much concurrency that so many backends might be trying to
change the system state so quickly.

> One disadvantage of the way you've got it from a user perspective is
> that if I'm writing a tool, I might get an error telling me that the
> state change I'm trying to make is already in progress, and then I
> have to retry. With the other design, I might attempt a state change
> and have it fail because the change can't be completed, but I won't
> ever fail because I attempt a state change and it can't be started
> because we're in the wrong starting state. So, with this design, as
> the tool author, I may not be able to just say, well, I tried to
> change the state and it didn't work, so report the error to the user.
> I think with the other approach that would be more viable. But I might
> be wrong here; it would be interesting to hear what other people
> think.
>

Thinking a little bit more, I agree that your approach is more viable as it can
cancel previously in-progress state.

For e.g. in a graceful failure future, the master might have detected that he
lost the connection to all standby and immediately calls the function to change
the system state to read-only. But, it regains the connection soon and wants to
back to read-write then it might need to wait until the previous state
completion. That might be the worst if the system is quite busy and/or any
backend which might have stuck or too busy and could not absorb the barrier.

If you want, I try to change the way you have thought, in the next version.

> I dislike the use of the term state_gen or StateGen to refer to the
> combination of a state and a generation. That seems unintuitive. I'm
> tempted to propose that we just call it a counter, and, assuming we
> stick with the design as you now have it, explain it with a comment
> like this in walprohibit.h:
>
> "There are four possible states. A brand new database cluster is
> always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to
> make it read only, then we enter the state
> WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we
> enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently
> tries to make it read write, we will enter the state
> WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete,
> we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state
> transitions are the only ones possible; for example, if we're
> currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go
> read-write will produce an error, and a second attempt to go read-only
> will not cause a state change. Thus, we can represent the state as a
> shared-memory counter that whose value only ever changes by adding 1.
> The initial value at postmaster startup is either 0 or 2, depending on
> whether the control file specifies the the system is starting
> read-only or read-write."
>

Thanks, added the same.

> And then maybe change all the state_gen references to reference
> wal_prohibit_counter or, where a shorter name is appropriate, counter.
>

Done.

> I think this might be clearer if we used different data types for the
> state and the state/generation combination, with functions to convert
> between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0
> etc. maybe do:
>
> typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState;
>
> And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like:
>
> static inline WALProhibitState
> GetWALProhibitState(uint32 wal_prohibit_counter)
> {
> return (WALProhibitState) (wal_prohibit_counter & 3);
> }
>

Done.

> I don't really know why we need WALPROHIBIT_NEXT_STATE at all,
> honestly. It's just a macro to add 1 to an integer. And you don't even
> use it consistently. Like pg_alter_wal_prohibit_state() does this:
>
> + /* Server is already in requested state */
> + if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state)
> + PG_RETURN_VOID();
>
> But then later does this:
>
> + next_state_gen = cur_state_gen + 1;
>
> Which is exactly the same thing as what you computed above using
> WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly
> sure how to structure this to make it as simple as possible, but I
> don't think this is it.
>
> Honestly this whole logic here seems correct but a bit hard to follow.
> Like, maybe:
>
> wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter);
> switch (GetWALProhibitState(wal_prohibit_counter))
> {
> case WALPROHIBIT_STATE_READ_WRITE:
> if (!walprohibit) return;
> increment = true;
> break;
> case WALPROHIBIT_STATE_GOING_READ_WRITE:
> if (walprohibit) ereport(ERROR, ...);
> break;
> ...
> }
>
> And then just:
>
> if (increment)
> wal_prohibit_counter =
> pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1);
> target_counter_value = wal_prohibit_counter + 1;
> // random stuff
> // eventually wait until the counter reaches >= target_counter_value
>
> This might not be exactly the right idea though. I'm just looking for
> a way to make it clearer, because I find it a bit hard to understand
> right now. Maybe you or someone else will have a better idea.
>

Yeah, this makes code much cleaner than before, did the same in the attached
version. Thanks again.

> + success =
> pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
> +
> &cur_state_gen, next_state_gen);
> + Assert(success);
>
> I am almost positive that this is not OK. I think on some platforms
> atomics just randomly fail some percentage of the time. You always
> need a retry loop. Anyway, what happens if two people enter this
> function at the same time and both read the same starting counter
> value before either does anything?
>
> + /* To be sure that any later reads of memory happen
> strictly after this. */
> + pg_memory_barrier();
>
> You don't need a memory barrier after use of an atomic. The atomic
> includes a barrier.

Understood, removed.

Regards,
Amul

Attachment Content-Type Size
v14-0001-WIP-Implement-wal-prohibit-state-using-global-ba.patch application/x-patch 43.3 KB
v14-0003-WIP-Documentation.patch application/x-patch 6.5 KB
v14-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch application/x-patch 69.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-01-28 12:28:14 Re: Determine parallel-safety of partition relations for Inserts
Previous Message vignesh C 2021-01-28 11:52:24 Re: Printing backtrace of postgres processes