Re: Setting BLCKSZ 4kB

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, sanyam jain <sanyamjain22(at)live(dot)in>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Setting BLCKSZ 4kB
Date: 2018-01-27 11:40:03
Message-ID: 39f9fcb4-33e9-52bd-0c44-aa1b5d2fcd21@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/27/2018 05:01 AM, Bruce Momjian wrote:
> On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote:
>>
>> ...
>>
>> FWIW even if it's not save in general, it would be useful to
>> understand what are the requirements to make it work. I mean,
>> conditions that need to be met on various levels (sector size of
>> the storage device, page size of of the file system, filesystem
>> alignment, ...).
>
> I think you are fine as soon the data arrives at the durable
> storage, and assuming the data can't be partially written to durable
> storage. I was thinking more of a case where you have a file system,
> a RAID card without a BBU, and then magnetic disks. In that case,
> even if the file system were to write in 4k chunks, the RAID
> controller would also need to do the same, and with the same
> alignment. Of course, that's probably a silly example since there is
> probably no way to atomically write 4k to a magnetic disk.
>
> Actually, what happens if a 4k write is being written to an SSD and
> the server crashes. Is the entire write discarded?
>

AFAIK it's not possible to end up with a partial write, particularly not
such that would contain a mix of old and new data - that's because SSDs
can't overwrite a block without erasing it first.

So the write should either succeed or fail as a whole, depending on when
exactly the server crashes - it might be right before confirming the
flush back to the client, for example. That assumes the drive has 4kB
sectors (internal pages) - on drives with volatile write cache but
supporting write barriers and cache flushes. On drives with non-volatile
write cache (so with battery/capacitor) it should always succeed and
never get discarded.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2018-01-27 13:20:38 Write lifetime hints for NVMe
Previous Message Erik Rijkers 2018-01-27 11:08:38 Re: Add RANGE with values and exclusions clauses to the Window Functions