Re: [Patch] ALTER SYSTEM READ ONLY

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amul Sul <sulamul(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] ALTER SYSTEM READ ONLY
Date: 2021-01-25 21:07:49
Message-ID: CA+TgmobApx_pKVLN6S2-Rcj9hLDQes7T45Db2PHv=TVudPWUqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul(at)gmail(dot)com> wrote:
> It is not like that, let me explain. When a user backend requests to alter WAL
> prohibit state by using ASRO/ASRW DDL with the previous patch or calling
> pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
> set to the transition state i.e. going-read-only or going-read-write if it is
> not already. If another backend trying to request the same alteration to the
> wal prohibit state then nothing going to be changed in shared memory but that
> backend needs to wait until the transition to the final wal prohibited state
> completes. If a backend tries to request for the opposite state than the
> previous which is in progress then it will see an error as "system state
> transition to read only/write is already in progress". At a time only one
> transition state can be set.

Hrm. Well, then that needs to be abundantly clear in the relevant comments.

> Now, I am a little bit concerned about the current function name. How about
> pg_set_wal_prohibit_state(bool) name or have two functions as
> pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
> other suggestions?

How about pg_prohibit_wal(true|false)?

> > It seems odd that ProcessWALProhibitStateChangeRequest() returns
> > without doing anything if !AmCheckpointerProcess(), rather than having
> > that be an Assert(). Why is it like that?
>
> Like AbsorbSyncRequests().

Well, that can be called not from the checkpointer, according to the
comments. Specifically from the postmaster, I guess. Again, comments
please.

> If you think that there will be panic when UpdateFullPageWrites() and/or
> XLogReportParameters() tries to write WAL since the shared memory state for WAL
> prohibited is set then it is not like that. For those functions, WAL write is
> explicitly enabled by calling LocalSetXLogInsertAllowed().
>
> I was under the impression that there won't be any problem if we allow the
> writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be
> considered as an exception since it is fine that this WAL record is not streamed
> to standby while graceful failover, I may be wrong though.

I don't think that's OK. I mean, the purpose of the feature is to
prohibit WAL. If it doesn't do that, I believe it will fail to satisfy
the principle of least surprise.

> I am not clear on this part. In the attached version I am sending SIGUSR1
> instead of SIGINT, which works for me.

OK.

> The attached version does not address all your comments, I'll continue my work
> on that.

Some thoughts on this version:

+/* Extract last two bits */
+#define WALPROHIBIT_CURRENT_STATE(stateGeneration) \
+ ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define WALPROHIBIT_NEXT_STATE(stateGeneration) \
+ WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))

This is really confusing. First, the comment looks like it applies to
both based on how it is positioned, but that's clearly not true.
Second, the naming is really hard to understand. Third, there don't
seem to be comments explaining the theory of what is going on here.
Fourth, stateGeneration refers not to which generation of state we've
got here but to the combination of the state and the generation.
However, it's not clear that we ever really use the generation for
anything.

I think that the direction you went with this is somewhat different
from what I had in mind. That may be OK, but let me just explain the
difference. We both had in mind the idea that the low two bits of the
state would represent the current state and the upper bits would
represent the state generation. However, I wasn't necessarily
imagining that the only supported operation was making the combined
value go up by 1. For instance, I had thought that perhaps the effect
of trying to go read-only when we're in the middle of going read-write
would be to cancel the previous operation and start the new one. What
you have instead is that it errors out. So in your model a change
always has to finish before the next one can start, which in turn
means that the sequence is completely linear. In my idea the
state+generation might go from say 1 to 7, because trying to go
read-write would cancel the previous attempt to go read-only and
replace it with an attempt to go the other direction, and from 7 we
might go to to 9 if somebody now tries to go read-only again before
that finishes. In your model, there's never any sort of cancellation
of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc.

One disadvantage of the way you've got it from a user perspective is
that if I'm writing a tool, I might get an error telling me that the
state change I'm trying to make is already in progress, and then I
have to retry. With the other design, I might attempt a state change
and have it fail because the change can't be completed, but I won't
ever fail because I attempt a state change and it can't be started
because we're in the wrong starting state. So, with this design, as
the tool author, I may not be able to just say, well, I tried to
change the state and it didn't work, so report the error to the user.
I think with the other approach that would be more viable. But I might
be wrong here; it would be interesting to hear what other people
think.

I dislike the use of the term state_gen or StateGen to refer to the
combination of a state and a generation. That seems unintuitive. I'm
tempted to propose that we just call it a counter, and, assuming we
stick with the design as you now have it, explain it with a comment
like this in walprohibit.h:

"There are four possible states. A brand new database cluster is
always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to
make it read only, then we enter the state
WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we
enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently
tries to make it read write, we will enter the state
WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete,
we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state
transitions are the only ones possible; for example, if we're
currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go
read-write will produce an error, and a second attempt to go read-only
will not cause a state change. Thus, we can represent the state as a
shared-memory counter that whose value only ever changes by adding 1.
The initial value at postmaster startup is either 0 or 2, depending on
whether the control file specifies the the system is starting
read-only or read-write."

And then maybe change all the state_gen references to reference
wal_prohibit_counter or, where a shorter name is appropriate, counter.

I think this might be clearer if we used different data types for the
state and the state/generation combination, with functions to convert
between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0
etc. maybe do:

typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState;

And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like:

static inline WALProhibitState
GetWALProhibitState(uint32 wal_prohibit_counter)
{
return (WALProhibitState) (wal_prohibit_counter & 3);
}

I don't really know why we need WALPROHIBIT_NEXT_STATE at all,
honestly. It's just a macro to add 1 to an integer. And you don't even
use it consistently. Like pg_alter_wal_prohibit_state() does this:

+ /* Server is already in requested state */
+ if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state)
+ PG_RETURN_VOID();

But then later does this:

+ next_state_gen = cur_state_gen + 1;

Which is exactly the same thing as what you computed above using
WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly
sure how to structure this to make it as simple as possible, but I
don't think this is it.

Honestly this whole logic here seems correct but a bit hard to follow.
Like, maybe:

wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter);
switch (GetWALProhibitState(wal_prohibit_counter))
{
case WALPROHIBIT_STATE_READ_WRITE:
if (!walprohibit) return;
increment = true;
break;
case WALPROHIBIT_STATE_GOING_READ_WRITE:
if (walprohibit) ereport(ERROR, ...);
break;
...
}

And then just:

if (increment)
wal_prohibit_counter =
pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1);
target_counter_value = wal_prohibit_counter + 1;
// random stuff
// eventually wait until the counter reaches >= target_counter_value

This might not be exactly the right idea though. I'm just looking for
a way to make it clearer, because I find it a bit hard to understand
right now. Maybe you or someone else will have a better idea.

+ success =
pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+
&cur_state_gen, next_state_gen);
+ Assert(success);

I am almost positive that this is not OK. I think on some platforms
atomics just randomly fail some percentage of the time. You always
need a retry loop. Anyway, what happens if two people enter this
function at the same time and both read the same starting counter
value before either does anything?

+ /* To be sure that any later reads of memory happen
strictly after this. */
+ pg_memory_barrier();

You don't need a memory barrier after use of an atomic. The atomic
includes a barrier.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-01-25 23:12:01 Re: Key management with tests
Previous Message Andres Freund 2021-01-25 20:48:13 Re: Snapbuild woes followup