Re: Online checksums patch - once again

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online checksums patch - once again
Date: 2020-09-09 14:23:01
Message-ID: 76117EC5-A4B8-43DC-B4A2-EF6AAF70F149@yesql.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 2 Sep 2020, at 14:22, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:

> The main synchronization mechanisms are the use of the inprogress mode where
> data checksums are written but not verified, and by waiting for all
> pre-existing non-compatible processes (transactions, temp tables) to disappear
> before enabling.
>
> That being handwavily said, I've started to write down a matrix with classes of
> possible synchronization bugs and how the patch handles them in order to
> properly respond.

Having spent some more time on this, I believe I have a better answer (and
patch version) to give.

First, a thank you for asking insightful questions. While working through the
cases I realized that the previous version has a problematic window: when
disabling checksums, if backend A absorbs the data checksums "off" barrier (or
starts after the controlfile has been changed) and writes a page without a
checksum, while backend B has yet to absorb the barrier and is still in "on"
and reads that very same page. The solution IMO is to introduce an inprogress
state for disabling as well where all backends keep writing checksums but not
validating them until no backend is in "on" state anymore. Once all backends
are in "inprogress-off", they can stop writing checksums and transition to
"off". I even had this state in a previous unsubmitted version, embarrassingly
so to the point of the function prototype being there as a leftover in v19.

Now, synchronization happens on two levels in this patch for both the enable
and disable case : (i) between backends when transitioning between states and
(ii) inside the worker when synchronizing the current backends.

For (i) it's using "inprogress" states to ensure a safe transition to "on" or
"off". The states are themselves transitioned to via procsignalbarriers. Both
enabling and disabling follow the same logic, with the only difference being
the order in which operations are switched on/off during inprogress. For (ii)
the workers are waiting for incompatible concurrent processing to end, such as
temporary tables etc (only affects enabling data checksums).

I've tried to write down the synchronization steps in datachecksumsworker.c to
document the code, and for ease of discussion I've pasted that part of the diff
below as well:

* Synchronization and Correctness
* -------------------------------
* The processes involved in enabling, or disabling, data checksums in an
* online cluster must be properly synchronized with the normal backends
* serving concurrent queries to ensure correctness. Correctness is defined
* as the following:
*
* - Backends SHALL NOT violate local datachecksum state
* - Data checksums SHALL NOT be considered enabled cluster-wide until all
* currently connected backends have the local state "enabled"
*
* There are two levels of synchronization required for enabling data checksums
* in an online cluster: (i) changing state in the active backends ("on",
* "off", "inprogress-on" and "inprogress-off"), and (ii) ensuring no
* incompatible objects and processes are left in a database when workers end.
* The former deals with cluster-wide agreement on data checksum state and the
* latter with ensuring that any concurrent activity cannot break the data
* checksum contract during processing.
*
* Synchronizing the state change is done with procsignal barriers, where the
* backend updating the global state in the controlfile will wait for all other
* backends to absorb the barrier before WAL logging. Barrier absorption will
* happen during interrupt processing, which means that connected backends will
* change state at different times.
*
* When Enabling Data Checksums
* ----------------------------
* A process which fails to observe data checksums being enabled can induce
* two types of errors: failing to write the checksum when modifying the page
* and failing to validate the data checksum on the page when reading it.
*
* When the DataChecksumsWorker has finished writing checksums on all pages
* and enable data checksums cluster-wide, there are three sets of backends:
*
* Bg: Backend updating the global state and emitting the procsignalbarrier
* Bd: Backends on "off" state
* Be: Backends in "on" state
* Bi: Backends in "inprogress-on" state
*
* Backends transition from the Bd state to Be like so: Bd -> Bi -> Be
*
* Backends in Bi and Be will write checksums when modifying a page, but only
* backends in Be will verify the checksum during reading. The Bg backend is
* blocked waiting for all backends in Bi to process interrupts and move to
* Be. Any backend starting will observe the global state being "on" and will
* thus automatically belong to Be. Checksums are enabled cluster-wide when
* Bi is an empty set. All sets are compatible while still operating based on
* their local state.
*
* When Disabling Data Checksums
* -----------------------------
* A process which fails to observe data checksums being disabled can induce
* two types of errors: writing the checksum when modifying the page and
* validating a data checksum which is no longer correct due to modifications
* to the page.
*
* Bg: Backend updating the global state and emitting the procsignalbarrier
* Bd: Backands in "off" state
* Be: Backends in "on" state
* Bi: Backends in "inprogress-off" state
*
* Backends transition from the Be state to Bd like so: Be -> Bi -> Bd
*
* The goal is to transition all backends to Bd making the others empty sets.
* Backends in Bi writes data checksums, but don't validate them, such that
* backends still in Be can continue to validate pages until the barrier has
* been absorbed such that they are in Bi. Once all backends are in Bi, the
* barrier to transition to "off" can be raised and all backends can safely
* stop writing data checksums as no backend is enforcing data checksum
* validation.

I hope this clarifies the reasoning behind the implementation.

This has been implemented in the attached v21 patch. "inprogress" was in need
of a new name before, and with "inprogress-{on|off}" the need is even bigger.
Suggestions for better names are highly appreciated, I'm drawing blanks here.

There are some minor fixes and documentation touchups in this version as well,
but not the SIGINT handling since I wanted to focus on one thing at a time.

cheers ./daniel

Attachment Content-Type Size
online_checksums21.patch application/octet-stream 109.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-09-09 14:24:10 Re: Online checksums patch - once again
Previous Message Jobin Augustine 2020-09-09 14:17:22 Re: unsupportable composite type partition keys