Re: Large block sizes support in Linux

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: "Pankaj Raghav (Samsung)" <kernel(at)pankajraghav(dot)com>, pgsql-hackers(at)postgresql(dot)org, p(dot)raghav(at)samsung(dot)com, mcgrof(at)kernel(dot)org, gost(dot)dev(at)samsung(dot)com
Subject: Re: Large block sizes support in Linux
Date: 2024-03-23 02:41:41
Message-ID: Zf5BZVA4UhbSlLa4@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 22, 2024 at 10:31:11PM +0100, Tomas Vondra wrote:
> Right, but things change over time - current storage devices support
> much larger sectors (LBA format), usually 4K. And if you do I/O with
> this size, it's usually atomic.
>
> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
> format, that would not need full-page writes - we always do I/O in 4k
> pages, and block layer does I/O (during writeback from page cache) with
> minimum guaranteed size = logical block size. 4K are great for OLTP
> systems in general, it'd be even better if we didn't need to worry about
> torn pages (but the tricky part is to be confident it's safe to disable
> them on a particular system).

Yes, even if the file system is 8k, and the storage is 8k, we only know
that torn pages are impossible if the file system never overwrites
existing 8k pages, but writes new ones and then makes it active. I
think ZFS does that to handle snapshots.

> The other thing is - is there a reliable way to say when the guarantees
> actually apply? I mean, how would the administrator *know* it's safe to
> set full_page_writes=off, or even better how could we verify this when
> the database starts (and complain if it's not safe to disable FPW)?

Yes, this is quite hard to know. Our docs have:

https://www.postgresql.org/docs/current/wal-reliability.html

Another risk of data loss is posed by the disk platter write operations
themselves. Disk platters are divided into sectors, commonly 512 bytes
each. Every physical read or write operation processes a whole sector.
When a write request arrives at the drive, it might be for some multiple
of 512 bytes (PostgreSQL typically writes 8192 bytes, or 16 sectors, at
a time), and the process of writing could fail due to power loss at any
time, meaning some of the 512-byte sectors were written while others
were not. To guard against such failures, PostgreSQL periodically writes
full page images to permanent WAL storage before modifying the actual
page on disk. By doing this, during crash recovery PostgreSQL can
--> restore partially-written pages from WAL. If you have file-system
--> software that prevents partial page writes (e.g., ZFS), you can turn off
--> this page imaging by turning off the full_page_writes parameter.
--> Battery-Backed Unit (BBU) disk controllers do not prevent partial page
--> writes unless they guarantee that data is written to the BBU as full
--> (8kB) pages.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-03-23 03:02:28 Re: SQL:2011 application time
Previous Message Andrew Dunstan 2024-03-23 02:27:16 Re: session username in default psql prompt?