Re: Setting BLCKSZ 4kB

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: sanyam jain <sanyamjain22(at)live(dot)in>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Setting BLCKSZ 4kB
Date: 2018-01-26 22:53:33
Message-ID: c969f17a-19d2-ed2a-66a1-7f5116081d45@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/26/2018 02:56 PM, Bruce Momjian wrote:
> On Wed, Jan 17, 2018 at 02:10:10PM +0100, Fabien COELHO wrote:
>>
>> Hello,
>>
>>> What are the cons of setting BLCKSZ as 4kB? When saw the results published
>>> on [...].
>>
>> There were other posts and publications which points to the same direction
>> consistently.
>>
>> This matches my deep belief is that postgres default block size is a
>> reasonable compromise for HDD, but is less pertinent for SSD for most OLTP
>> loads.
>>
>> For OLAP, I do not think it would lose much, but I have not tested it.
>>
>>> Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size
>>> of file system is 4kB?
>>
>> FPW = Full Page Write. I would not bet on turning off FPW, ISTM
>> that SSDs can have "page" sizes as low as 512 bytes, but are
>> typically 2 kB or 4 kB, and the information easily available
>> anyway.
>

Is this referring to sector size or the internal SSD page size?

AFAIK there are only 512B and 4096B sectors, so I assume you must be
talking about the latter. I don't think I've ever heard about an SSD
with 512B pages though (generally the page sizes are 2kB to 16kB).

But more importantly, I don't see why the size of the internal page
would matter here at all? SSDs have non-volatile write cache (DRAM with
battery), protecting all the internal writes to pages. If your SSD does
not do that correctly, it's already broken no matter what page size it
uses even with full_page_writes=on.

On spinning rust the caches would be disabled and replaced by write
cache on a RAID controller with battery, but that's not possible on SSDs
where the on-disk cache is baked into the whole design.

What I think does matters here is the sector size (i.e. either 512B or
4096B) used to communicate with the disk. Obviously, if the kernel
writes 4kB page as a series of independent 512B writes, that would be
unreliable. If it sends one 4kB write, why wouldn't that work?

> Yes, that is the hard part, making sure you have 4k granularity of
> write, and matching write alignment. pg_test_fsync and diskchecker.pl
> (which we mention in our docs) will not help here. A specific
> alignment test based on diskchecker.pl would have to be written.
> However, if you look at the kernel code you might be able to verify
> quickly that the 4k atomicity is not guaranteed.
>

Are you suggesting there's a part of the kernel code clearly showing
it's not atomic? Can you point us to that part of the kernel sources?

FWIW even if it's not save in general, it would be useful to understand
what are the requirements to make it work. I mean, conditions that need
to be met on various levels (sector size of the storage device, page
size of of the file system, filesystem alignment, ...).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-01-26 22:58:48 Re: [HACKERS] Refactoring identifier checks to consistently use strcmp
Previous Message Daniel Gustafsson 2018-01-26 22:30:08 Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative