Re: [Patch] ALTER SYSTEM READ ONLY

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amul Sul <sulamul(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] ALTER SYSTEM READ ONLY
Date: 2020-09-09 21:03:42
Message-ID: 20200909210342.ybgg7fyy7wuk7bra@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thomas, there's one point below that could be relevant for you. You can
search for your name and/or checkpoint...

On 2020-09-01 16:43:10 +0530, Amul Sul wrote:
> diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
> index 42050ab7195..0ac826d3c2f 100644
> --- a/src/backend/nodes/readfuncs.c
> +++ b/src/backend/nodes/readfuncs.c
> @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
> READ_DONE();
> }
>
> +/*
> + * _readAlterSystemWALProhibitState
> + */
> +static AlterSystemWALProhibitState *
> +_readAlterSystemWALProhibitState(void)
> +{
> + READ_LOCALS(AlterSystemWALProhibitState);
> +
> + READ_BOOL_FIELD(WALProhibited);
> +
> + READ_DONE();
> +}
> +

Why do we need readfuncs support for this?

> +
> +/*
> + * AlterSystemSetWALProhibitState
> + *
> + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> + */
> +static void
> +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> +{
> + /* some code */
> + elog(INFO, "AlterSystemSetWALProhibitState() called");
> +}

As long as it's not implemented it seems better to return an ERROR.

> @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
> VariableSetStmt *setstmt; /* SET subcommand */
> } AlterSystemStmt;
>
> +/* ----------------------
> + * Alter System Read Statement
> + * ----------------------
> + */
> +typedef struct AlterSystemWALProhibitState
> +{
> + NodeTag type;
> + bool WALProhibited;
> +} AlterSystemWALProhibitState;
> +

All the nearby fields use under_score_style names.

> From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
> From: Amul Sul <amul(dot)sul(at)enterprisedb(dot)com>
> Date: Fri, 19 Jun 2020 06:29:36 -0400
> Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.
>
> Implementation:
>
> 1. When a user tried to change server state to WAL-Prohibited using
> ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
> raises request to checkpointer by marking current state to inprogress in
> shared memory. Checkpointer, noticing that the current state is has

"is has"

> WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
> then acknowledges back to the backend who requested the state change once
> the transition has been completed. Final state will be updated in control
> file to make it persistent across the system restarts.

What makes checkpointer the right backend to do this work?

> 2. When a backend receives the WAL-Prohibited barrier, at that moment if
> it is already in a transaction and the transaction already assigned XID,
> then the backend will be killed by throwing FATAL(XXX: need more discussion
> on this)

> 3. Otherwise, if that backend running transaction which yet to get XID
> assigned we don't need to do anything special

Somewhat garbled sentence...

> 4. A new transaction (from existing or new backend) starts as a read-only
> transaction.

Maybe "(in an existing or in a new backend)"?

> 5. Autovacuum launcher as well as checkpointer will don't do anything in
> WAL-Prohibited server state until someone wakes us up. E.g. a backend
> might later on request us to put the system back to read-write.

"will don't do anything", "might later on request us"

> 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
> and xlog rotation. Starting up again will perform crash recovery(XXX:
> need some discussion on this as well) but the end of recovery checkpoint
> will be skipped and it will be performed when the system changed to
> WAL-Permitted mode.

Hm, this has some interesting interactions with some of Thomas' recent
hacking.

> 8. Only super user can toggle WAL-Prohibit state.

Hm. I don't quite agree with this. We try to avoid if (superuser())
style checks these days, because they can't be granted to other
users. Look at how e.g. pg_promote() - an operation of similar severity
- is handled. We just revoke the permission from public in
system_views.sql:
REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;

> 9. Add system_is_read_only GUC show the system state -- will true when system
> is wal prohibited or in recovery.

*shows the system state. There's also some oddity in the second part of
the sentence.

Is it really correct to show system_is_read_only as true during
recovery? For one, recovery could end soon after, putting the system
into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
during recovery the database state actually changes if there are changes
to replay. ISTM it would not be a good idea to mix ASRO and
pg_is_in_recovery() into one GUC.

> --- /dev/null
> +++ b/src/backend/access/transam/walprohibit.c
> @@ -0,0 +1,321 @@
> +/*-------------------------------------------------------------------------
> + *
> + * walprohibit.c
> + * PostgreSQL write-ahead log prohibit states
> + *
> + *
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
> + *
> + * src/backend/access/transam/walprohibit.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/walprohibit.h"
> +#include "pgstat.h"
> +#include "port/atomics.h"
> +#include "postmaster/bgwriter.h"
> +#include "storage/condition_variable.h"
> +#include "storage/procsignal.h"
> +#include "storage/shmem.h"
> +
> +/*
> + * Shared-memory WAL prohibit state
> + */
> +typedef struct WALProhibitStateData
> +{
> + /* Indicates current WAL prohibit state */
> + pg_atomic_uint32 SharedWALProhibitState;
> +
> + /* Startup checkpoint pending */
> + bool checkpointPending;
> +
> + /* Signaled when requested WAL prohibit state changes */
> + ConditionVariable walprohibit_cv;

You're using three different naming styles for as many members.

> +/*
> + * ProcessBarrierWALProhibit()
> + *
> + * Handle WAL prohibit state change request.
> + */
> +bool
> +ProcessBarrierWALProhibit(void)
> +{
> + /*
> + * Kill off any transactions that have an XID *before* allowing the system
> + * to go WAL prohibit state.
> + */
> + if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))

Hm. I wonder if this check is good enough. If you look at
RecordTransactionCommit() we also WAL log in some cases where no xid was
assigned. This is particularly true of (auto-)vacuum, but also for HOT
pruning.

I think it'd be good to put the logic of this check into xlog.c and
mirror the logic in RecordTransactionCommit(). And add cross-referencing
comments to RecordTransactionCommit and the new function, reminding our
futures selves that both places need to be modified.

> + {
> + /* Should be here only for the WAL prohibit state. */
> + Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);

There are no races where an ASRO READ ONLY is quickly followed by ASRO
READ WRITE where this could be reached?

> +/*
> + * AlterSystemSetWALProhibitState()
> + *
> + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> + */
> +void
> +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> +{
> + uint32 state;
> +
> + if (!superuser())
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> + errmsg("must be superuser to execute ALTER SYSTEM command")));

See comments about this above.

> + /* Alter WAL prohibit state not allowed during recovery */
> + PreventCommandDuringRecovery("ALTER SYSTEM");
> +
> + /* Requested state */
> + state = stmt->WALProhibited ?
> + WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
> +
> + /*
> + * Since we yet to convey this WAL prohibit state to all backend mark it
> + * in-progress.
> + */
> + state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
> +
> + if (!SetWALProhibitState(state))
> + return; /* server is already in the desired state */
> +

This use of bitmasks seems unnecessary to me. I'd rather have one param
for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one
for WALPROHIBIT_TRANSITION_IN_PROGRESS

> +/*
> + * RequestWALProhibitChange()
> + *
> + * Request checkpointer to make the WALProhibitState to read-only.
> + */
> +static void
> +RequestWALProhibitChange(void)
> +{
> + /* Must not be called from checkpointer */
> + Assert(!AmCheckpointerProcess());
> + Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> +
> + /*
> + * If in a standalone backend, just do it ourselves.
> + */
> + if (!IsPostmasterEnvironment)
> + {
> + CompleteWALProhibitChange(GetWALProhibitState());
> + return;
> + }
> +
> + send_signal_to_checkpointer(SIGINT);
> +
> + /* Wait for the state to change to read-only */
> + ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
> + for (;;)
> + {
> + /* We'll be done once in-progress flag bit is cleared */
> + if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> + break;
> +
> + ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
> + WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
> + }
> + ConditionVariableCancelSleep();

What if somebody concurrently changes the state back to READ WRITE?
Won't we unnecessarily wait here?

That's probably fine, because we would just wait until that transition
is complete too. But at least a comment about that would be
good. Alternatively a "ASRO transitions completed counter" or such might
be a better idea?

> +/*
> + * CompleteWALProhibitChange()
> + *
> + * Checkpointer will call this to complete the requested WAL prohibit state
> + * transition.
> + */
> +void
> +CompleteWALProhibitChange(uint32 wal_state)
> +{
> + uint64 barrierGeneration;
> +
> + /*
> + * Must be called from checkpointer. Otherwise, it must be single-user
> + * backend.
> + */
> + Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
> + Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> +
> + /*
> + * WAL prohibit state change is initiated. We need to complete the state
> + * transition by setting requested WAL prohibit state in all backends.
> + */
> + elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
> +
> + /* Emit global barrier */
> + barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
> + WaitForProcSignalBarrier(barrierGeneration);
> +
> + /* And flush all writes. */
> + XLogFlush(GetXLogWriteRecPtr());

Hm, maybe I'm missing something, but why is the write pointer the right
thing to flush? That won't include records that haven't been written to
disk yet... We also need to trigger writing out all WAL that is as of
yet unwritten, no? Without having thought a lot about it, it seems that
GetXLogInsertRecPtr() would be the right thing to flush?

> + /* Set final state by clearing in-progress flag bit */
> + if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> + {
> + bool wal_prohibited;
> +
> + wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
> +
> + /* Update the control file to make state persistent */
> + SetControlFileWALProhibitFlag(wal_prohibited);

Hm. Is there an issue with not WAL logging the control file change? Is
there a scenario where we a crash + recovery would end up overwriting
this?

> + if (wal_prohibited)
> + ereport(LOG, (errmsg("system is now read only")));
> + else
> + {
> + /*
> + * Request checkpoint if the end-of-recovery checkpoint has been
> + * skipped previously.
> + */
> + if (WALProhibitState->checkpointPending)
> + {
> + RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
> + CHECKPOINT_IMMEDIATE);
> + WALProhibitState->checkpointPending = false;
> + }
> + ereport(LOG, (errmsg("system is now read write")));
> + }
> + }
> +
> + /* Wake up the backend who requested the state change */
> + ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);

Could be multiple backends, right?

> +}
> +
> +/*
> + * GetWALProhibitState()
> + *
> + * Atomically return the current server WAL prohibited state
> + */
> +uint32
> +GetWALProhibitState(void)
> +{
> + return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
> +}

Is there an issue with needing memory barriers here?

> +/*
> + * SetWALProhibitState()
> + *
> + * Change current WAL prohibit state to the input state.
> + *
> + * If the server is already completely moved to the requested WAL prohibit
> + * state, or if the desired state is same as the current state, return false,
> + * indicating that the server state did not change. Else return true.
> + */
> +bool
> +SetWALProhibitState(uint32 new_state)
> +{
> + bool state_updated = false;
> + uint32 cur_state;
> +
> + cur_state = GetWALProhibitState();
> +
> + /* Server is already in requested state */
> + if (new_state == cur_state ||
> + new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
> + return false;
> +
> + /* Prevent concurrent contrary in progress transition state setting */
> + if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
> + (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> + {
> + if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("system state transition to read only is already in progress"),
> + errhint("Try after sometime again.")));
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("system state transition to read write is already in progress"),
> + errhint("Try after sometime again.")));
> + }
> +
> + /* Update new state in share memory */
> + state_updated =
> + pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
> + &cur_state, new_state);
> +
> + if (!state_updated)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("system read write state concurrently changed"),
> + errhint("Try after sometime again.")));
> +

I don't think it's safe to use pg_atomic_compare_exchange_u32() outside
of a loop. I think there's platforms (basically all load-linked /
store-conditional architectures) where than can fail spuriously.

Also, there's no memory barrier around GetWALProhibitState, so there's
no guarantee it's not an out-of-date value you're starting with.

> +/
> + * MarkCheckPointSkippedInWalProhibitState()
> + *
> + * Sets checkpoint pending flag so that it can be performed next time while
> + * changing system state to WAL permitted.
> + */
> +void
> +MarkCheckPointSkippedInWalProhibitState(void)
> +{
> + WALProhibitState->checkpointPending = true;
> +}

I don't *at all* like this living outside of xlog.c. I think this should
be moved there, and merged with deferring checkpoints in other cases
(promotions, not immediately performing a checkpoint after recovery).
There's state in ControlFile *and* here for essentially the same thing.

> + * If it is not currently possible to insert write-ahead log records,
> + * either because we are still in recovery or because ALTER SYSTEM READ
> + * ONLY has been executed, force this to be a read-only transaction.
> + * We have lower level defences in XLogBeginInsert() and elsewhere to stop
> + * us from modifying data during recovery when !XLogInsertAllowed(), but
> + * this gives the normal indication to the user that the transaction is
> + * read-only.
> + *
> + * On the other hand, we only need to set the startedInRecovery flag when
> + * the transaction started during recovery, and not when WAL is otherwise
> + * prohibited. This information is used by RelationGetIndexScan() to
> + * decide whether to permit (1) relying on existing killed-tuple markings
> + * and (2) further killing of index tuples. Even when WAL is prohibited
> + * on the master, it's still the master, so the former is OK; and since
> + * killing index tuples doesn't generate WAL, the latter is also OK.
> + * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
> + */
> + XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
> + s->startedInRecovery = RecoveryInProgress();

It's somewhat ugly that we call RecoveryInProgress() once in
XLogInsertAllowed() and then again directly here... It's probably fine
runtime cost wise, but...

> /*
> * Subroutine to try to fetch and validate a prior checkpoint record.
> *
> @@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
> */
> WalSndWaitStopping();
>
> + /*
> + * The restartpoint, checkpoint, or xlog rotation will be performed if the
> + * WAL writing is permitted.
> + */
> if (RecoveryInProgress())
> CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
> - else
> + else if (XLogInsertAllowed())

Not sure I like going via XLogInsertAllowed(), that seems like a
confusing indirection here. And it encompasses things we atually don't
want to check for - it's fragile to also look at LocalXLogInsertAllowed
here imo.

> ShutdownCLOG();
> ShutdownCommitTs();
> ShutdownSUBTRANS();
> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 1b8cd7bacd4..aa4cdd57ec1 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
>
> HandleAutoVacLauncherInterrupts();
>
> + /* If the server is read only just go back to sleep. */
> + if (!XLogInsertAllowed())
> + continue;
> +

I think we really should have a different functions for places like
this. We don't want to generally hide bugs like e.g. starting the
autovac launcher in recovery, but this would.

> @@ -342,6 +344,28 @@ CheckpointerMain(void)
> AbsorbSyncRequests();
> HandleCheckpointerInterrupts();
>
> + wal_state = GetWALProhibitState();
> +
> + if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
> + {
> + /* Complete WAL prohibit state change request */
> + CompleteWALProhibitChange(wal_state);
> + continue;
> + }
> + else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
> + {
> + /*
> + * Don't do anything until someone wakes us up. For example a
> + * backend might later on request us to put the system back to
> + * read-write wal prohibit sate.
> + */
> + (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> + WAIT_EVENT_CHECKPOINTER_MAIN);
> + continue;
> + }
> + Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
> +
> /*
> * Detect a pending checkpoint request by checking whether the flags
> * word in shared memory is nonzero. We shouldn't need to acquire the
> @@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)
>
> return FirstCall;
> }

So, if we're in the middle of a paced checkpoint with a large
checkpoint_timeout - a sensible real world configuration - we'll not
process ASRO until that checkpoint is over? That seems very much not
practical. What am I missing?

> +/*
> + * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
> + */
> +void
> +send_signal_to_checkpointer(int signum)
> +{
> + if (CheckpointerShmem->checkpointer_pid == 0)
> + elog(ERROR, "checkpointer is not running");
> +
> + if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
> + elog(ERROR, "could not signal checkpointer: %m");
> +}

Sudden switch to a different naming style...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-09-09 21:04:35 Re: [HACKERS] [PATCH] Generic type subscripting
Previous Message Jonathan S. Katz 2020-09-09 20:57:11 Re: PG 13 release notes, first draft