Re: Setting BLCKSZ 4kB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, sanyam jain <sanyamjain22(at)live(dot)in>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Setting BLCKSZ 4kB
Date: 2018-01-27 04:01:08
Message-ID: 20180127040108.GA30459@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote:
>
>
> On 01/26/2018 02:56 PM, Bruce Momjian wrote:
> > Yes, that is the hard part, making sure you have 4k granularity of
> > write, and matching write alignment. pg_test_fsync and diskchecker.pl
> > (which we mention in our docs) will not help here. A specific
> > alignment test based on diskchecker.pl would have to be written.
> > However, if you look at the kernel code you might be able to verify
> > quickly that the 4k atomicity is not guaranteed.
> >
>
> Are you suggesting there's a part of the kernel code clearly showing
> it's not atomic? Can you point us to that part of the kernel sources?

Well, my point is that you would either need to repeatedly test that the
file system writes to some durable storage in 4k chunks or check the
file system source code to see it does that. I don't know how to check
the file system source code myself. The other issue is that it has to
write 4k chunks using the same alignment as the file itself.

> FWIW even if it's not save in general, it would be useful to understand
> what are the requirements to make it work. I mean, conditions that need
> to be met on various levels (sector size of the storage device, page
> size of of the file system, filesystem alignment, ...).

I think you are fine as soon the data arrives at the durable storage,
and assuming the data can't be partially written to durable storage. I
was thinking more of a case where you have a file system, a RAID card
without a BBU, and then magnetic disks. In that case, even if the file
system were to write in 4k chunks, the RAID controller would also need
to do the same, and with the same alignment. Of course, that's probably
a silly example since there is probably no way to atomically write 4k to
a magnetic disk.

Actually, what happens if a 4k write is being written to an SSD and the
server crashes. Is the entire write discarded?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2018-01-27 06:52:35 Re: JIT compiling with LLVM v9.0
Previous Message Andres Freund 2018-01-27 02:40:42 Re: JIT compiling with LLVM v9.0