Re: What is the posix_memalign() equivalent for the PostgreSQL?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Anderson Carniel <accarniel(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: What is the posix_memalign() equivalent for the PostgreSQL?
Date: 2016-09-14 21:34:10
Message-ID: CA+TgmoZEcG3u7DzTpQtzYUpqRnvb3cKdW3G+ZbdM+9Lq=JeQTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 2, 2016 at 1:17 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2016-09-02 13:05:37 -0400, Tom Lane wrote:
>> Anderson Carniel <accarniel(at)gmail(dot)com> writes:
>> > If not, according to your experience, is there a
>> > significance difference between the performance of the O_DIRECT or not?
>>
>> AFAIK, nobody's really bothered to measure whether that would be useful
>> for Postgres. The results would probably be quite platform-specific
>> anyway.
>
> I've played with patches to make postgres use O_DIRECT. On linux, it's
> rather beneficial for some workloads (fits into memory), but it also
> works really badly for some others, because our IO code isn't
> intelligent enough. We pretty much rely on write() being nearly
> instantaneous when done by normal backends (during buffer replacement),
> we rely on readahead, we rely on the kernel to stopgap some bad
> replacement decisions we're making.

So, suppose we changed the world so that backends don't write dirty
buffers, or at least not normally. If they need to perform a buffer
eviction, they first check the freelist, then run the clock sweep.
The clock sweep puts clean buffers on the freelist and dirty buffers
on a to-be-cleaned list. A background process writes buffers on the
to-be-cleaned list and then adds them to the freelist afterward if the
usage count hasn't been bumped meanwhile. As in Amit's bgreclaimer
patch, we have a target size for the freelist, with a low watermark
and a high watermark. When we drop below the low watermark, the
background processes run the clock sweep and write from the
to-be-cleaned list to try to populate it; when we surge above the high
watermark, they go back to sleep.

Further, suppose we also create a prefetch system, maybe based on the
synchronous scan machinery. It preemptively pulls data into
shared_buffers if an ongoing scan will need it soon. Or maybe don't
base it on the synchronous scan machinery, but instead just have a
queue that lets backends throw prefetch requests over the wall; when
the queue wraps, old requests are discarded. A background process -
or perhaps one per tablespace or something like that - pull the data
in.

Neither of those things seems that hard. And if we could do those
things and make them work, then maybe we could offer direct I/O as an
option. We'd still lose heavily in the case where our buffer eviction
decisions are poor, but that'd probably spur some improvement in that
area, which IMHO would be a good thing.

I personally think direct I/O would be a really good thing, not least
because O_ATOMIC is designed to allow MySQL to avoid double buffering,
their alternative to full page writes. But we can't use it because it
requires O_DIRECT. The savings are probably massive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-09-14 21:54:07 Re: Choosing parallel_degree
Previous Message Jeff Janes 2016-09-14 20:21:43 Re: pageinspect: Hash index support