Re: [PATCH] Prefetch index pages for B-Tree index scans

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: John Lumby <johnlumby(at)hotmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, cedric(at)2ndquadrant(dot)com
Subject: Re: [PATCH] Prefetch index pages for B-Tree index scans
Date: 2012-11-02 05:05:02
Message-ID: CAGTBQpZCtb9L-WYLD-wn-_eXsMAer3dALukkcZZEmg2tsYDQgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 11/1/12 6:13 PM, Claudio Freire wrote:
>
>> posix_fadvise what's the trouble there, but the fact that the kernel
>> stops doing read-ahead when a call to posix_fadvise comes. I noticed
>> the performance hit, and checked the kernel's code. It effectively
>> changes the prediction mode from sequential to fadvise, negating the
>> (assumed) kernel's prefetch logic.
>
>
...
>
> The Linux posix_fadvise implementation never seemed like it was well liked
> by the kernel developers. Quirky stuff like this popped up all the time
> during that period, when effective_io_concurrency was being added. I wonder
> how far back the fadvise/read-ahead conflict goes back.

Well, to be precise it's not so much as a problem in posix_fadvise
itself, it's a problem in how it interacts with readahead. Since
readahead works at the memory mapper level, and only when actually
performing I/O (which would seem at first glance quite sensible), it
doesn't get to see fadvise activity.

FADV_WILLNEED is implemented as a forced readahead, which doesn't
update any of the readahead context structures. Again, at first
glance, this would seem sensible (explicit hints shouldn't interfere
with pattern detection logic). However, since those pages are (after
the fadvise call) under async I/O, next time the memory mapper needs
that page, instead of requesting I/O through readahead logic, it will
wait for async I/O to complete.

IOW, what was sequential in fact, became invisible to readahead,
indistinguishable from random I/O. Whatever page fadvise failed to
predict will be treated as random I/O, and here the trouble lies.

>> I've mused about the possibility to batch async_io requests, and use
>> the scatter/gather API instead of sending tons of requests to the
>
>> kernel. I think doing so would enable a zero-copy path that could very
>> possibly imply big speed improvements when memory bandwidth is the
>> bottleneck.
>
> Another possibly useful bit of history here for you. Greg Stark wrote a
> test program that used async I/O effectively on both Linux and Solaris.
> Unfortunately, it was hard to get that to work given how Postgres does its
> buffer I/O, and using processes instead of threads. This looks like the
> place he commented on why:
>
> http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html
>
> The part I think was relevant there from him:
>
> "In the libaio view of the world you initiate io and either get a
> callback or call another syscall to test if it's complete. Either
> approach has problems for Postgres. If the process that initiated io
> is in the middle of a long query it might take a long time, or not even
> never get back to complete the io. The callbacks use threads...
>
> And polling for completion has the problem that another process could
> be waiting on the io and can't issue a read as long as the first
> process has the buffer locked and io in progress. I think aio makes a
> lot more sense if you're using threads so you can start a thread to
> wait for the io to complete."

I noticed that. I always envisioned async I/O as managed by some
dedicated process. One that could check for completion or receive
callbacks. Postmaster, for instance.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2012-11-02 05:40:14 Comment typo
Previous Message Greg Smith 2012-11-02 01:59:29 Re: [PATCH] Prefetch index pages for B-Tree index scans