Re: Proposal of PITR performance improvement for 8.4.

From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-29 01:14:48
Message-ID: a778a7260810281814g35069ec2pcc40d60cd578e19e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki,

1. In the code, all the referenced page is extracted from the WAL
records in scope and sorted to schedule and avoid double
posix_fadvise() calls. If full page write is included as the first
WAL record, such page is not prefetched. Although there's still
some risk to call posix_fadvise() for the pages already in the kernel
cache, but I don't think the overhead is not so big. I'd like to
measure how much improvement we can achieve with some benchmark.

2. Yes, we have to analyze WAL record using specific code and I think
it is not good. At present, major part of this is from xlogdump.
Are you proposing to improve WAL record format to be less rm-specific?
It's interesting too.

3. Yes, the code includes all the WAL record handling except for those
which does not need any "read" operation in the recovery, such as
creating new pages with new record. Improving WAL record format as
2. will help to simplify this as well.

2008/10/28 Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>:
> Gregory Stark wrote:
>>
>> "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> writes:
>>
>>> This is my first proposal of PITR performance improvement for
>>> PostgreSQL 8.4 development. This proposal includes readahead
>>> mechanism of data pages which will be read by redo() routines in the
>>> recovery. This is especially effective in the recovery without full
>>> page write. Readahead is done by posix_fadvise() as proposed in
>>> index scan improvement.
>>
>> Incidentally: a bit of background for anyone who wasn't around when last
>> this
>> came up: prefetching is especially for our recovery code because it's
>> single-threaded. If you have a raid array you're effectively limited to
>> using
>> a single drive at a time. This is a major problem because the logs could
>> have
>> been written by many processes hammering the raid array concurrently. In
>> other
>> words your warm standby database might not be able to keep up with the
>> logs
>> from the master database even on identical (or even better) hardware.
>>
>> Simon (I think?) proposed allowing our recovery code to be multi-threaded.
>> Heikki suggested using prefetching.
>
> I actually played around with the prefetching, and even wrote a quick
> prototype of it, about a year ago. It read ahead a fixed number of the WAL
> records in xlog.c, calling posix_fadvise() for all pages that were
> referenced in them. I never got around to finish it, as I wanted to see
> Greg's posix_fadvise() patch get done first and rely on the same
> infrastructure, but here's some lessons I learned:
>
> 1. You should avoid useless posix_fadvise() calls. In the naive
> implementation, where you simply call posix_fadvise() for every page
> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per
> WAL record, and that's a lot of overhead. We face the same design question
> as with Greg's patch to use posix_fadvise() to prefetch index and bitmap
> scans: what should the interface to the buffer manager look like? The
> simplest approach would be a new function call like AdviseBuffer(Relation,
> BlockNumber), that calls posix_fadvise() for the page if it's not in the
> buffer cache, but is a no-op otherwise. But that means more overhead, since
> for every page access, we need to find the page twice in the buffer cache;
> once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer().
> It would be more efficient to pin the buffer in the AdviseBuffer() call
> already, but that requires much more changes to the callers.
>
> 2. The format of each WAL record is different, so you need a "readahead
> handler" for every resource manager, for every record type. It would be a
> lot simpler if there was a standardized way to store that information in the
> WAL records.
>
> 3. IIRC I tried to handle just a few most important WAL records at first,
> but it turned out that you really need to handle all WAL records (that are
> used at all) before you see any benefit. Otherwise, every time you hit a WAL
> record that you haven't done posix_fadvise() on, the recovery "stalls", and
> you don't need much of those to diminish the gains.
>
> Not sure how these apply to your approach, it's very different. You seem to
> handle 1. by collecting all the page references for the WAL file, and
> sorting and removing the duplicates. I wonder how much CPU time is spent on
> that?
>
>>> Details of the implementation will be found in README file in the
>>> material.
>>
>> I've read through this and I think I disagree with the idea of using a
>> separate program. It's a lot of extra code -- and duplicated code from the
>> normal recovery too.
>
> Agreed, this belongs into core. The nice thing about a separate process is
> that you could hook it into recovery_command, with no changes to the server,
> but as you note in the README, we'd want to use this in crash recovery as
> well, and the interaction between the external command and the startup
> process seems overly complex for that. Besides, we want to use the
> posix_fadvise() stuff in the backend anyway, so we should use the same
> infrastructure during WAL replay as well.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

--
------
Koichi Suzuki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-10-29 01:16:20 Re: UUID-OSSP Contrib Module Compilation Issue
Previous Message Bruce McAlister 2008-10-29 01:04:23 Re: UUID-OSSP Contrib Module Compilation Issue