Re: [GENERAL] Slow PITR restore

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GENERAL] Slow PITR restore
Date: 2007-12-13 20:25:07
Message-ID: 47619523.8010107@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Simon Riggs wrote:
> Allocate a recovery cache of size maintenance_work_mem that goes away
> when recovery ends.
>
> For every block mentioned in WAL record that isn't an overwrite, first
> check shared_buffers. If its in shared_buffers apply immediately and
> move on. If not in shared_buffers then put in recovery cache.
>
> When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
> blockid, lsn. Then we scan through the records applying them in
> sequence. That way we will accumulate changes on each block so we only
> need to request it once rather than thrashing the cache. We may get
> lucky and pick up some OS readahead also. We would also use buffer
> recycling when emptying the recovery cache, to ensure that we don't
> trash the main cache and also gain from L2 cache efficiency.
>
> When recovery ends, empty the cache.

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it
doesn't seem impossible that something like Florian's read-only queries
on a stand by server would change that.

> I think that is better than both methods mentioned, and definitely
> simpler than my brute-force method. It also lends itself to using both
> previously mentioned methods as additional techniques if we really
> needed to. I suspect reordering the I/Os in this way is going to make a
> huge difference to cache hit rates.

But it won't actually do anything to scale the I/O. You're still going
to be issuing only one read request at a time. The order of those
requests will be better from cache hit point of view, which is good, but
the problem remains that if the modified data blocks are scattered
around the database, you'll be doing random I/O, one request at a time.

It would be interesting to do something like that to speed up replay of
long PITR archives, though. You could scan all (or at least far ahead)
the WAL records, and make note of where there is full page writes for
each page. Whenever there's a full page write further ahead in the log,
you could ignore all changes to that page before that, because they're
going to be overwritten anyway. It won't help with normal recovery,
because there won't be more than one full page image of each page after
the last checkpoint, but with PITR it would help.

> Looks like each rmgr_redo call would need to be split into two calls:
> rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
> apply if possible, otherwise place in cache. The second gets called
> repeatedly during cache emptying.

Yeah, much like the split I had to do for the posix_fadvise.

It seems that in all the proposed schemes we need to know which blocks a
given WAL record will need to access. For multiple recovery processes,
you need that to figure out which WAL records you can safely replay. In
the posix_fadvise scheme, you need that to issue the posix_fadvises
without modifying anything.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andreas Kretschmer 2007-12-13 20:27:24 Re: data type change on a view
Previous Message Tom Lane 2007-12-13 19:37:02 Re: Hash join in 8.3

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-12-13 21:13:16 Re: [GENERAL] Slow PITR restore
Previous Message Joshua D. Drake 2007-12-13 19:12:26 Re: Slow PITR restore