Re: O_DIRECT for relations and SLRUs (Prototype)

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: O_DIRECT for relations and SLRUs (Prototype)
Date: 2019-01-13 09:02:16
Message-ID: 20190113090216.GB6220@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote:
> 1. We need a new "bgreader" process to do read-ahead. I think you'd
> want a way to tell it with explicit hints (for example, perhaps
> sequential scans would advertise that they're reading sequentially so
> that it starts to slurp future blocks into the buffer pool, and
> streaming replicas might look ahead in the WAL and tell it what's
> coming). In theory this might be better than the heuristics OSes use
> to guess our access pattern and pre-fetch into the page cache, since
> we have better information (and of course we're skipping a buffer
> layer).

Yes, that could be interesting mainly for analytics by being able to
snipe better than the OS readahead.

> 2. We need a new kind of bgwriter/syncer that aggressively creates
> clean pages so that foreground processes rarely have to evict (since
> that is now super slow), but also efficiently finds ranges of dirty
> blocks that it can write in big sequential chunks.

Okay, that's a new idea. A bgwriter able to do syncs in chunks would
be also interesting with O_DIRECT, no?

> 3. We probably want SLRUs to use the main buffer pool, instead of
> their own mini-pools, so they can benefit from the above.

Wasn't there a thread about that on -hackers actually? I cannot see
any reference to it.

> Whether we need multiple bgreader and bgwriter processes or perhaps a
> general IO scheduler process may depend on whether we also want to
> switch to async (multiplexing from a single process). Starting simple
> with a traditional sync IO and N processes seems OK to me.

So you mean that we could just have a simple switch as a first step?
Or I misunderstood you :)

One of the reasons why I have begun this thread is that since we have
heard about the fsync issues on Linux, I think that there is room
for giving our user base more control of their fate without relying on
the Linux community decisions to potentially eat data and corrupt a
cluster with a page dirty bit cleared without its data actually
flushed. Even the latest kernels are not fixing all the patterns with
open fds across processes, switching the problem from one corner of
the table to another, and there are folks patching the Linux kernel to
make Postgres more reliable from this perspective, and living happily
with this option. As long as the option can be controlled and
defaults to false, it seems to be that we could do something. Even if
the performance is bad, this gives the user control of how he/she
wants things to be done.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-01-13 09:17:32 Re: could recovery_target_timeline=latest be the default in standby mode?
Previous Message Michael Paquier 2019-01-13 07:56:00 Re: [Sender Address Forgery]Re: error message when subscription target is a partitioned table