Re: Proposal of PITR performance improvement for 8.4.

From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-30 00:58:44
Message-ID: a778a7260810291758v76a048c6g9c83d0676de2d040@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

2008/10/29 Simon Riggs <simon(at)2ndquadrant(dot)com>:
>
> On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote:
>
>> 1. You should avoid useless posix_fadvise() calls. In the naive
>> implementation, where you simply call posix_fadvise() for every page
>> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls
>> per WAL record, and that's a lot of overhead. We face the same design
>> question as with Greg's patch to use posix_fadvise() to prefetch index
>> and bitmap scans: what should the interface to the buffer manager look
>> like? The simplest approach would be a new function call like
>> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the
>> page if it's not in the buffer cache, but is a no-op otherwise. But that
>> means more overhead, since for every page access, we need to find the
>> page twice in the buffer cache; once for the AdviseBuffer() call, and
>> 2nd time for the actual ReadBuffer().
>
> That's a much smaller overhead than waiting for an I/O. The CPU overhead
> isn't really a problem if we're I/O bound.

As disccused last year about parallel recovery and random read
problem, recovery is really I/O bound, especially when FPW is not
available. And it is not practical to ask all the archive logs to
include huge FPWs.

>
>> It would be more efficient to pin
>> the buffer in the AdviseBuffer() call already, but that requires much
>> more changes to the callers.
>
> That would be hard to cleanup safely, plus we'd have difficulty with
> timing: is there enough buffer space to allow all the prefetched blocks
> live in cache at once? If not, this approach would cause problems.

I'm not positive to AdviseBuffer() adea. If we do this, we need all
the pages reffered from a WAL segment in the shared buffer. This may
be several GB and will compete with kernel cache. Current
PostgreSQL highly relies on kernel cache (and kernel I/O schedule) and
it is not a good idea to have much shared buffer. The worst case is
to spare half of the physical memory to the shared buffer. The
performance will be very bad. Rather, I prefer to ask kernel to
prefetch.

>
>> 2. The format of each WAL record is different, so you need a "readahead
>> handler" for every resource manager, for every record type. It would be
>> a lot simpler if there was a standardized way to store that information
>> in the WAL records.
>
> I would prefer a new rmgr API call that returns a list of blocks. That's
> better than trying to make everything fit one pattern. If the call
> doesn't exist then that rmgr won't get prefetch.

Yes, I'd like this idea. Could you let me try this API through
prefetch implementation in the core (if it is agreed)?

>
>> 3. IIRC I tried to handle just a few most important WAL records at
>> first, but it turned out that you really need to handle all WAL records
>> (that are used at all) before you see any benefit. Otherwise, every time
>> you hit a WAL record that you haven't done posix_fadvise() on, the
>> recovery "stalls", and you don't need much of those to diminish the gains.
>>
>> Not sure how these apply to your approach, it's very different. You seem
>> to handle 1. by collecting all the page references for the WAL file, and
>> sorting and removing the duplicates. I wonder how much CPU time is spent
>> on that?
>
> Removing duplicates seems like it will save CPU.

If we invoke posix_fadvise() to the blocks already in the kernel
cache, this call will just do nothing but consume some overhead in the
kernel. I think duplicate removal saves more.

>
> --
> Simon Riggs www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>
>

--
------
Koichi Suzuki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-10-30 01:01:25 Re: Proposal of PITR performance improvement for 8.4.
Previous Message Koichi Suzuki 2008-10-30 00:46:10 Re: Proposal of PITR performance improvement for 8.4.