Re: [Patch] Checksums for SLRU files

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Ivan Kartyshov <i(dot)kartyshov(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] Checksums for SLRU files
Date: 2018-08-02 03:36:16
Message-ID: CAEepm=3QXnV=bJHxK86iKnC600==Uf7qZpEL6E9yvfZEL8yusg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 2, 2018 at 1:20 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> On 2018-Aug-02, Thomas Munro wrote:
>> PostgreSQL only requires atomic writes of 512 bytes (see
>> PG_CONTROL_MAX_SAFE_SIZE), the traditional sector size for disks made
>> approximately 1980-2010, though as far as I know spinning disks made
>> this decade use 4KB sectors, and for SSDs there is more variation. I
>> suppose the theory for torn SLRU page safety today is that the
>> existing SLRU users all have fully independent values that don't cross
>> sector boundaries, so torn writes can't corrupt them.
>
> Hmm, I wonder if this is true for multixact/members. I think it's not
> true for either 4kB sectors nor for 512 byte sectors.

Hmm, right, the set of members can span sectors. Let me try that
again. You can cross sector boundaries, but only if you don't require
any kind of multi-sector consistency during replay.

I think the important property for correct operation without FPWs is
that you can't read data from the page itself in order to redo writes
to the page. That rules out whole-page checksum verification, and
probably requires "physical" addressing. By physical addressing I
mean for example that the WAL record that writes member data must know
exactly where to put it on the page without, for example, consulting
the page header or item pointers to data that can move data around
("logical" intra-page addressing). We make the page consistent
incrementally, because each WAL record that writes new members into a
page is concerned with a specific physical part of the page identified
by offset and doesn't care about the rest, and no one should ever try
to read any part of it that hasn't already been made consistent. This
seems OK.

Another way to say it is that FPWs are physical logging of whole pages
(they say how to set every single bit), and WAL for multixacts is a
bit like physical logging of smaller regions of the page. Physical
logging doesn't suffer from torn pages, as long as readers are also
looking stuff up by physical addresses and never trying to read areas
of the page that haven't been written to yet. If you want page-level
checksums, though, the incremental approach won't work.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-08-02 03:47:13 Re: Speeding up INSERTs and UPDATEs to partitioned tables
Previous Message Andres Freund 2018-08-02 03:08:47 Re: Explain buffers wrong counter with parallel plans