Re: [GENERAL] Slow PITR restore

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GENERAL] Slow PITR restore
Date: 2007-12-13 12:28:51
Message-ID: 47612583.4090705@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Gregory Stark wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>
>> We would have readbuffers in shared memory, like wal_buffers in reverse.
>> Each worker would read the next WAL record and check there is no
>> conflict with other concurrent WAL records. If not, it will apply the
>> record immediately, otherwise wait for the conflicting worker to
>> complete.
>
> Well I guess you would have to bring up the locking infrastructure and lock
> any blocks in the record you're applying (sorted first to avoid deadlocks). As
> I understand it we don't use locks during recovery now but I'm not sure if
> that's just because we don't have to or if there are practical problems which
> would have to be solved to do so.

We do use locks during recovery, XLogReadBuffer takes an exclusive lock
on the buffer. According to the comments there, it wouldn't be strictly
necessary. But I believe we do actually need it to protect from
bgwriter writing out a buffer while it's been modified. We only lock one
page at a time, which is good enough for WAL replay, but not to protect
things like b-tree split from concurrent access.

I hacked together a quick & dirty prototype of using posix_fadvise in
recovery a while ago. First of all, there's the changes to the buffer
manager, which we'd need anyway if we wanted to use posix_fadvise for
speeding up other stuff like index scans. Then there's changes to
xlog.c, to buffer a number of WAL records, so that you can read ahead
the data pages needed by WAL records ahead of the WAL record you're
actually replaying.

I added a new function, readahead, to the rmgr API. It's similar to the
redo function, but it doesn't actually replay the WAL record, but just
issues the fadvise calls to the buffer manager for the pages that are
needed to replay the WAL record. This needs to be implemented for each
resource manager that we want to do readahead for. If we had the list of
blocks in the WAL record in a rmgr-independent format, we could do that
in a more generic way, like we do the backup block restoration.

The multiple-process approach seems a lot more complex to me. You need a
lot of bookkeeping to keep the processes from stepping on each others
toes, and to choose the next WAL record to replay. I think you have the
same problem that you need to have a rmgr-specific function to extract
the data blocks #s required to replay that WAL record, or add that list
to the WAL record header in a generic format. Multi-process approach is
nice because it allows you to parallelize the CPU work of replaying the
records as well, but I wonder how much that really scales given all the
locking required. Also, I don't think replaying WAL records is very
expensive CPU-wise. You'd need a pretty impressive RAID array to read
WAL from, to saturate a single CPU.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Guillaume Smet 2007-12-13 12:36:37 Re: [GENERAL] Slow PITR restore
Previous Message f.zamboni@mastertraining 2007-12-13 12:06:01 accessing multiple databases using dblink

Browse pgsql-hackers by date

  From Date Subject
Next Message Guillaume Smet 2007-12-13 12:36:37 Re: [GENERAL] Slow PITR restore
Previous Message Manolo _ 2007-12-13 12:09:35 tuplesort.c