Re: Direct I/O

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Andrea Gelmini <andrea(dot)gelmini(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Noah Misch <noah(at)leadboat(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Direct I/O
Date: 2023-04-11 02:31:40
Message-ID: CA+hUKGKLr1G5DFWZWNPvmyj5tGFMRqZj=VnX7PYOqkQbR4B_kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 11, 2023 at 2:15 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> And the fix has been merged into
> https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next
>
> I think that means it'll have to wait for 6.4 development to open (in a few
> weeks), and then will be merged into the stable branches from there.

Great! Let's hope/assume for now that that'll fix phenomenon #2.
That still leaves the checksum-vs-concurrent-modification thing that I
called phenomenon #1, which we've not actually hit with PostgreSQL yet
but is clearly possible and can be seen with the stand-alone
repro-program I posted upthread. You wrote:

On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think we really need to think about whether we eventually we want to do
> something to avoid modifying pages while IO is in progress. The only
> alternative is for filesystems to make copies of everything in the IO path,
> which is far from free (and obviously prevents from using DMA for the whole
> IO). The copy we do to avoid the same problem when checksums are enabled,
> shows up quite prominently in write-heavy profiles, so there's a "purely
> postgres" reason to avoid these issues too.

+1

I wonder what the other file systems that maintain checksums (see list
at [1]) do when the data changes underneath a write. ZFS's policy is
conservative[2], while BTRFS took the demons-will-fly-out-of-your-nose
route. I can see arguments for both approaches (ZFS can only reach
zero-copy optimum by turning off checksums completely, while BTRFS is
happy to assume that if you break this programming rule that is not
written down anywhere then you must never want to see your data ever
again). What about ReFS? CephFS?

I tried to find out what POSIX says about this WRT synchronous
pwrite() (as Tom suggested, maybe we're doing something POSIX doesn't
allow), but couldn't find it in my first attempt. It *does* say it's
undefined for aio_write() (which means that my prototype
io_method=posix_aio code that uses that stuff is undefined in presense
of hintbit modifications). I don't really see why it should vary
between synchronous and asynchronous interfaces (considering the
existence of threads, shared memory etc, the synchronous interface
only removes one thread from list of possible suspects that could flip
some bits).

But yeah, in any case, it doesn't seem great that we do that.

[1] https://en.wikipedia.org/wiki/Comparison_of_file_systems#Block_capabilities
[2] https://openzfs.topicbox.com/groups/developer/T950b02acdf392290/odirect-semantics-in-zfs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2023-04-11 02:40:31 Re: longfin missing gssapi_ext.h
Previous Message Andres Freund 2023-04-11 02:15:00 Re: Direct I/O