Re: Large block sizes support in Linux

From: Pankaj Raghav <kernel(at)pankajraghav(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, p(dot)raghav(at)samsung(dot)com, mcgrof(at)kernel(dot)org, gost(dot)dev(at)samsung(dot)com
Subject: Re: Large block sizes support in Linux
Date: 2024-03-25 15:06:04
Message-ID: 97e72ccb-193f-43e0-97ac-17359c1c874b@pankajraghav.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23/03/2024 03:41, Bruce Momjian wrote:
> On Fri, Mar 22, 2024 at 10:31:11PM +0100, Tomas Vondra wrote:
>> Right, but things change over time - current storage devices support
>> much larger sectors (LBA format), usually 4K. And if you do I/O with
>> this size, it's usually atomic.
>>
>> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
>> format, that would not need full-page writes - we always do I/O in 4k
>> pages, and block layer does I/O (during writeback from page cache) with
>> minimum guaranteed size = logical block size. 4K are great for OLTP
>> systems in general, it'd be even better if we didn't need to worry about
>> torn pages (but the tricky part is to be confident it's safe to disable
>> them on a particular system).
>
> Yes, even if the file system is 8k, and the storage is 8k, we only know
> that torn pages are impossible if the file system never overwrites
> existing 8k pages, but writes new ones and then makes it active. I
> think ZFS does that to handle snapshots.
>

I think we can also avoid torn writes:
- if filesystem's data path always writes in multiples of 8k (with alignment)
- device supports 8k atomic writes.

Then we might be able to push the responsibility to the device without having the overhead
of a CoW FS or FPW=on. Of course, the performance here depends on the vendor specific
implementation of atomics.

We are trying to enable the former by adding LBS support to XFS in Linux.

--
Pankaj

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amonson, Paul D 2024-03-25 15:06:16 RE: Popcount optimization using AVX512
Previous Message Tom Lane 2024-03-25 14:53:12 Re: Add bump memory context type and use it for tuplesorts