Re: [Patch] ALTER SYSTEM READ ONLY

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: amul sul <sulamul(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] ALTER SYSTEM READ ONLY
Date: 2020-06-17 13:02:12
Message-ID: CAA4eK1+5BDNS08XKXR7UPkq4tDaV66wyZia4kDMvRUm=a_A=Gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul(at)gmail(dot)com> wrote:
>
> Hi,
>
> Attached patch proposes $Subject feature which forces the system into read-only
> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
> WRITE executed.
>
> The high-level goal is to make the availability/scale-out situation better. The feature
> will help HA setup where the master server needs to stop accepting WAL writes
> immediately and kick out any transaction expecting WAL writes at the end, in case
> of network down on master or replication connections failures.
>
> For example, this feature allows for a controlled switchover without needing to shut
> down the master. You can instead make the master read-only, wait until the standby
> catches up, and then promote the standby. The master remains available for read
> queries throughout, and also for WAL streaming, but without the possibility of any
> new write transactions. After switchover is complete, the master can be shut down
> and brought back up as a standby without needing to use pg_rewind. (Eventually, it
> would be nice to be able to make the read-only master into a standby without having
> to restart it, but that is a problem for another patch.)
>
> This might also help in failover scenarios. For example, if you detect that the master
> has lost network connectivity to the standby, you might make it read-only after 30 s,
> and promote the standby after 60 s, so that you never have two writable masters at
> the same time. In this case, there's still some split-brain, but it's still better than what
> we have now.
>
> Design:
> ----------
> The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
> global state changes to all active backends. Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and WAL
> permitted state respectively. When the checkpointer process sees the WAL prohibit
> state change request, it emits a global barrier and waits until all backends that
> participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
> share memory and control file will be updated so that XLogInsertAllowed() returns
> accordingly.
>

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well? If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

> If there are open transactions that have acquired an XID, the sessions are killed
> before the barrier is absorbed.
>

What about prepared transactions?

> They can't commit without writing WAL, and they
> can't abort without writing WAL, either, so we must at least abort the transaction. We
> don't necessarily need to kill the session, but it's hard to avoid in all cases because
> (1) if there are subtransactions active, we need to force the top-level abort record to
> be written immediately, but we can't really do that while keeping the subtransactions
> on the transaction stack, and (2) if the session is idle, we also need the top-level abort
> record to be written immediately, but can't send an error to the client until the next
> command is issued without losing wire protocol synchronization. For now, we just use
> FATAL to kill the session; maybe this can be improved in the future.
>
> Open transactions that don't have an XID are not killed, but will get an ERROR if they
> try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).
>

What if vacuum is on an unlogged relation? Do we allow writes via
vacuum to unlogged relation?

> To make that happen, the patch adds a new coding rule: a critical section that will write
> WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
> AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
> that inserting WAL here must be OK, either because we have an XID (we would have
> been killed by a change to read-only if one had occurred) or for some other reason.
>
> The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
> ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
> SYSTEM READ WRITE update not only the shared memory state but also the control
> file, so that changes survive a restart.
>
> The transition between read-write and read-only is a pretty major transition, so we emit
> log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
> command. Also, we have added a new GUC system_is_read_only which returns "on"
> when the system is in WAL prohibited state or recovery.
>
> Another part of the patch that quite uneasy and need a discussion is that when the
> shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
> startup recovery will be performed and latter the read-only state will be restored to
> prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
> concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
> even if it's later put back into read-write mode.
>

I am not able to understand this problem. What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2020-06-17 13:32:55 Re: language cleanups in code and docs
Previous Message torikoshia 2020-06-17 13:00:21 Creating a function for exposing memory usage of backend process