Re: Direct I/O

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Dagfinn Ilmari Mannsåker <ilmari(at)ilmari(dot)org>, Christoph Berg <myon(at)debian(dot)org>, mikael(dot)kjellstrom(at)gmail(dot)com, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Direct I/O
Date: 2023-04-19 17:10:13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2023-04-18 09:44:10 +1200, Thomas Munro wrote:
> * We have no plans to turn this on by default even when the later
> asynchronous machinery is proposed, and direct I/O starts to make more
> economic sense (think: your stream of small reads and writes will be
> converted to larger preadv/pwritev or moral equivalent and performed
> ahead of time in the background). Reasons: (1) There will always be a
> few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we
> learned in this thread; if fails with EINVAL at open() time), and (2)
> without a page cache, you really need to size your shared_buffers
> adequately and we can't do that automatically. It's something you'd
> opt into for a dedicated database server along with other carefully
> considered settings. It seems acceptable to me that if you set
> io_direct to a non-default setting on an unusual-for-a-database-server
> filesystem you might get errors screaming about inability to open
> files -- you'll just have to turn it back off again if it doesn't work
> for you.

FWIW, *long* term I think it might sense to turn DIO on automatically for a
small subset of operations, if supported. Examples:

1) Once we have the ability to "feed" walsenders from wal_buffers, instead of
going to disk, automatically using DIO for WAL might be beneficial. The
increase in IO concurrency and reduction in latency one can get is

2) If we make base backups use s_b if pages are in s_b, and do locking via s_b
for non-existing pages, it might be worth automatically using DIO for the
reads of the non-resident data, to avoid swamping the kernel page cache
with data that won't be read again soon (and to utilize DMA etc).

3) When writing back dirty data that we don't expect to be dirtied again soon,
e.g. from vacuum ringbuffers or potentially even checkpoints, it could make
sense to use DIO, to avoid the kernel keeping such pages in the page cache.

But for the main s_b, I agree, I can't forsee us turning on DIO by
default. Unless somebody has tuned s_b at least some for the workload, that's
not going to go well. And even if somebody has, it's quite reasonable to use
the same host also for other programs (including other PG instances), in which
case it's likely desirable to be adaptive to the current load when deciding
what to cache - which the kernel is in the best position to do.

> If the alignment trick from c.h appears to be available but is actually
> broken (GCC 4.2.1), then those assertions I added into smgrread() et
> al will fail as Tom showed (yay! they did their job), or in a
> non-assert build you'll probably get EINVAL when you try to read or
> write from your badly aligned buffers depending on how picky your OS
> is, but that's just an old bug in a defunct compiler that we have by
> now written more about they ever did in their bug tracker.

Agreed. If we ever find such issues in a postmordial compiler, we'll just need
to beef up our configure test to detect that it doesn't actually fully support
specifying alignment.


Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-04-19 17:13:55 Re: Direct I/O
Previous Message Andres Freund 2023-04-19 16:54:38 Re: Direct I/O