Double-writes, take two?

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Double-writes, take two?
Date: 2018-04-18 06:22:40
Message-ID: 20180418062240.GJ18178@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

Back in 2012, Dan Scales, who was working on VMware Postgres, has posted
a patch aimed at removing the need of full-page writes by introducing
the concept of double writes using a double-write buffer approach in
order to fix torn page problems:
https://www.postgresql.org/message-id/1962493974.656458.1327703514780.JavaMail.root%40zimbra-prod-mbox-4.vmware.com

A patch has been published back on the thread, and it has roughly the
following characteristics:
- Double writes happen when a dirty buffer is evicted.
- Two double-write buffers are used, one for the checkpointer and one for
other processes.
- LW locks are used with a bunch of memcpy used to maintain the batches
of double-write in a consistent state, and that's heavy.
- double-write buffers use a pre-decided numbers of pages (32 for the
checkpointer, 128 divided into 4 buckets for the backends), which are
synced into disk once each batch is full.
- The double-write file of the checkpointer uses ordering of pages using
blocks number and files to minimize the number of syncs to happen, using
a custom sequential I/O algorithm.
- The last point is aimed at improving performance. Processes willing to
write a page to the double-write file actually push pages to the buffer
first, which forces as well processes doing some smgrread() activity or
such to look at the double-write buffer.
- A custom page-level checksum was used, to make sure that a page in the
double-write file are not torned. Those are not normally mandatory and
they were not yet implemented in Postgres
- The implementation relies heavily on LWlocks, which kind of sucks for
concurrency.
- If one looks at the patch, the amount of fsyncs done is actually
pretty high, and the patch uses an approach close to what WAL does...
More on that downthread.
- In order to identify each block in the double-write file, a 4k header
is used to store each page's meta-data, limiting the number of pages
which can be stored in single double-write file.
- There is a performance hit when using smgrread and smgrsync, as double
writes could be on the way to the DW file so it is necessary to look at
active bactches and see if a wanted page is still there.
- IO_DIRECT is used for the double-write files, which is normally not
mandatory. Peter G has actually reminded me that the fork of Postgres
which VMware had was using IO_DIRECT, but this has been dropped when a
switch to pure upstream has happened. There is also a trace on the
mailing lists about that matter:
https://www.postgresql.org/message-id/529EEC1C.2040207@vmware.com
- At recovery, files are replayed and truncated. There is one file per
batch of pages written in a dedicated folder. If the page's checksums
is inconsistent in the double write file, then it is discarded. If the
page is consistent but that the original page of the data file is not,
then the block from the double-write file is copied back in place.

I have spent some time studying the patch, and I am getting pretty much
sure that the approach proposed has a lot of downsides and still
performs rather badly for cases where there is a number of dirty large
page evictions. OLTP loads would be less prone to that, but workloads
working on analytics would get a hit, like large scans with aggregates.
Once case which would be rather bad is I imagine a post-checkpoint
SELECT where hint bits need to be set.

We already have wal_log_hint which has similar performance impact but by
my lookup of the code and the proposed approach, the way of handling
the double-writes is way lessthan optimal and we have already
battle-proven facilities that can be reused.

One database system which is known for tackling torn page problems using
double writes is InnoDB, a storage engine for MySQL/MariaDB. In this
case, the main portion of the code is here:
storage/innobase/buf/buf0dblwr.c
storage/innobase/include/buf0dblwr.h
And here are the docs:
https://mariadb.com/kb/en/library/xtradbinnodb-doublewrite-buffer/
The approach used by those folks is a single-file approach, whose
concurrency is controlled by a set of mutex locks.

I was thinking about this problem, and it looks that one approach for
double-writes would be to introduce it as a secondary WAL stream
independent from the main one:
- Once a buffer is evicted from shared buffers and is dirty, write it to
double-write stream and to the data file, and only sync it to the
double-write stream.
- The low-level of WAL APIs need some refactoring, as the basic idea
would be to (ideally?) allow an initialization of a wanted WAL facility
using an API layer similar to what has been introduced for SLRUs which
is used for many facilities in the backend code.
- Compression of evicted pages can be supported the same way as we do
now for full-page writes using wal_compression.
- At recovery, replay the WAL stream for double-writes first.
Truncation and/or recycling of those files happens in a way similar to
the normal WAL stream and is controlled by checkpoints.
- At checkpoint, truncate the double-write files which are not needed
anymore as the corresponding data file's data have been sync'ed.
- Backups are a problem, so a first, clean, approach to make sure that
backups are consistent is to still enforce full-page writes when a
backup is taken, which is what currently happens internally in Postgres,
and then resume the double-writes once the backup is done. Rewind is a
second one, as a rewind would need a proper tracking of the blocks
modified since the last checkpoint where WAL has forked, so the
operation would be unsupported. Actually, this is not completely false
either, it seems to me that it could be possible to support both
operations with a double-write WAL stream for backups by making sure
that the stream is consistent with what's taken for backups.

I understand that this set of ideas is sort of crazy, but I wanted to
brainstorm a bit on the -hackers list and I got this set of ideas for
some time now, as there are many loads, particularly OLTP-like where
full-page writes are a large portion of the WAL stream traffic.

(I am still participating in the war effort to stabilize and test v11 of
course, don't worry about that.)

Thanks,
--
Michael

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2018-04-18 06:37:52 Re: Bugs in TOAST handling, OID assignment and redo recovery
Previous Message Amit Langote 2018-04-18 05:44:01 Re: Oddity in tuple routing for foreign partitions