Re: Online enabling of checksums

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject: Re: Online enabling of checksums
Date: 2018-02-22 08:14:00
Message-ID: CABUevExbt+fSHJHtZnTHOf5EiqQa+qDnNtiOdOq8pucyLT1K1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Re-sending this one with proper formatting. Apologies for the horrible
gmail-screws-up-the-text-part of the last one!

No change to patch or text, just the formatting.

//Magnus

Once more, here is an attempt to solve the problem of on-line enabling of
checksums that me and Daniel have been hacking on for a bit. See for
example
https://www.postgresql.org/message-id/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com
and
https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint(dot)com#FF393672-5608-46D6-9224-6620EC532693(at)endpoint(dot)com
for some previous discussions.

Base design:

Change the checksum flag to instead of on and off be an enum.
off/inprogress/on. When checksums are off and on, they work like today.
When checksums are in progress, checksums are *written* but not verified.
State can go from “off” to “inprogress”, from “inprogress” to either “on”
or “off”, or from “on” to “off”.

Two new functions are added, pg_enable_data_checksums() and
pg_disable_data_checksums(). The disable one is easy -- it just changes to
disable. The enable one will change the state to inprogress, and then start
a background worker (the “checksumhelper launcher”). This worker in turn
will start one sub-worker (“checksumhelper worker”) in each database
(currently all done sequentially). This worker will enumerate all
tables/indexes/etc in the database and validate their checksums. If there
is no checksum, or the checksum is incorrect, it will compute a new
checksum and write it out. When all databases have been processed, the
checksum state changes to “on” and the launcher shuts down. At this point,
the cluster has checksums enabled as if it was initdb’d with checksums
turned on.

If the cluster shuts down while “inprogress”, the DBA will have to manually
either restart the worker (by calling pg_enable_checksums()) or turn
checksums off again. Checksums “in progress” only carries a cost and no
benefit.

The change of the checksum state is WAL logged with a new xlog record. All
the buffers written by the background worker are forcibly enabled full page
writes to make sure the checksum is fully updated on the standby even if no
actual contents of the buffer changed.

We’ve also included a small commandline tool, bin/pg_verify_checksums, that
can be run against an offline cluster to validate all checksums. Future
improvements includes being able to use the background worker/launcher to
perform an online check as well. Being able to run more parallel workers in
the checksumhelper might also be of interest.

The patch includes two sets of tests, an isolation test turning on
checksums while one session is writing to the cluster and another is
continuously reading, to simulate turning on checksums in a production
database. There is also a TAP test which enables checksums with streaming
replication turned on to test the new xlog record. The isolation test ran
into the 1024 character limit of the isolation test lexer, with a separate
patch and discussion at
https://www.postgresql.org/message-id/8D628BE4-6606-4FF6-A3FF-8B2B0E9B43D0@yesql.se

Attachment Content-Type Size
online_checksums.patch text/x-patch 68.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-02-22 08:41:10 Re: [HACKERS] path toward faster partition pruning
Previous Message Michael Paquier 2018-02-22 07:55:38 Re: [bug fix] Cascaded standby cannot start after a clean shutdown