Re: WIP: WAL prefetch (another approach)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2020-11-13 02:20:26
Message-ID: CA+hUKGJwuzYCLxCP7QpK54uHBxEDGNcnDF4CJuJnuvFDpn4h5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 11, 2020 at 12:29 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I repeated the same testing I did before - I started with a 32GB pgbench
> database with archiving, run a pgbench for 1h to generate plenty of WAL,
> and then performed recovery from a snapshot + archived WAL on different
> storage types. The instance was running on NVMe SSD, allowing it ro
> generate ~200GB of WAL in 1h.

Thanks for running these tests! And sorry for the delay in replying.

> The recovery was done on two storage types - SATA RAID0 with 3 x 7.2k
> spinning drives and NVMe SSD. On each storage I tested three configs -
> disabled prefetching, defaults and increased values:
>
> wal_decode_buffer_size = 4MB (so 8x the default)
> maintenance_io_concurrency = 100 (so 10x the default)
>
> FWIW there's a bunch of issues with the GUCs - the .conf.sample file
> does not include e.g. recovery_prefetch, and instead includes
> #max_recovery_prefetch_distance which was however replaced by
> wal_decode_buffer_size. Another thing is that the actual default value
> differ from the docs - e.g. the docs say that wal_decode_buffer_size is
> 256kB by default, when in fact it's 512kB.

Oops. Fixed, and rebased.

> Now, some results ...
>
> 1) NVMe
>
> Fro the fast storage, there's a modest improvement. The time it took to
> recover the ~13k WAL segments are these
>
> no prefetch: 5532s
> default: 4613s
> increased: 4549s
>
> So the speedup from enabled prefetch is ~20% but increasing the values
> to make it more aggressive has little effect. Fair enough, the NVMe
> is probably fast enough to not benefig from longer I/O queues here.
>
> This is a bit misleading though, because the effectivity of prfetching
> very much depends on the fraction of FPI in the WAL stream - and right
> after checkpoint that's most of the WAL, which makes the prefetching
> less efficient. We still have to parse the WAL etc. without actually
> prefetching anything, so it's pure overhead.

Yeah. I've tried to reduce that overhead as much as possible,
decoding once and looking up the buffer only once. The extra overhead
caused by making posix_fadvise() calls is unfortunate (especially if
they aren't helping due to small shared buffers but huge page cache),
but should be fixed by switching to proper AIO, independently of this
patch, which will batch those and remove the pread().

> So I've also generated a chart showing time (in milliseconds) needed to
> apply individual WAL segments. It clearly shows that there are 3
> checkpoints, and that for each checkpoint it's initially very cheap
> (thanks to FPI) and as the fraction of FPIs drops the redo gets more
> expensive. At which point the prefetch actually helps, by up to 30% in
> some cases (so a bit more than the overall speedup). All of this is
> expected, of course.

That is a nice way to see the effect of FPI on recovery.

> 2) 3 x 7.2k SATA RAID0
>
> For the spinning rust, I had to make some compromises. It's not feasible
> to apply all the 200GB of WAL - it would take way too long. I only
> applied ~2600 segments for each configuration (so not even one whole
> checkpoint), and even that took ~20h in each case.
>
> The durations look like this:
>
> no prefetch: 72446s
> default: 73653s
> increased: 55409s
>
> So in this case the default settings is way too low - it actually makes
> the recovery a bit slower, while with increased values there's ~25%
> speedup, which is nice. I assume that if larger number of WAL segments
> was applied (e.g. the whole checkpoint), the prefetch numbers would be
> a bit better - the initial FPI part would play smaller role.

Huh. Interesting.

> From the attached "average per segment" chart you can see that the basic
> behavior is about the same as for NVMe - initially it's slower due to
> FPIs in the WAL stream, and then it gets ~30% faster.

Yeah. I expect that one day not too far away we'll figure out how to
get rid of FPIs (through a good enough double-write log or
O_ATOMIC)...

> Overall I think it looks good. I haven't looked at the code very much,
> and I can't comment on the potential optimizations mentioned a couple
> days ago yet.

Thanks!

I'm not really sure what to do about achive restore scripts that
block. That seems to be fundamentally incompatible with what I'm
doing here.

Attachment Content-Type Size
v13-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patch text/x-patch 3.4 KB
v13-0002-Improve-information-about-received-WAL.patch text/x-patch 7.8 KB
v13-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patch text/x-patch 60.0 KB
v13-0004-Prefetch-referenced-blocks-during-recovery.patch text/x-patch 64.1 KB
v13-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patch text/x-patch 10.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2020-11-13 02:36:39 Re: public schema default ACL
Previous Message Tomas Vondra 2020-11-13 01:14:28 Use extended statistics to estimate (Var op Var) clauses