Re: WIP: WAL prefetch (another approach)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2020-01-02 18:10:55
Message-ID: 20200102181055.v4jdxvof3wioyryl@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:
>Hello hackers,
>
>Based on ideas from earlier discussions[1][2], here is an experimental
>WIP patch to improve recovery speed by prefetching blocks. If you set
>wal_prefetch_distance to a positive distance, measured in bytes, then
>the recovery loop will look ahead in the WAL and call PrefetchBuffer()
>for referenced blocks. This can speed things up with cold caches
>(example: after a server reboot) and working sets that don't fit in
>memory (example: large scale pgbench).
>

Thanks, I only did a very quick review so far, but the patch looks fine.

In general, I find it somewhat non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
But how do you deduce the WAL distance from that? I don't know.

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.

>Results vary, but in contrived larger-than-memory pgbench crash
>recovery experiments on a Linux development system, I've seen recovery
>running as much as 20x faster with full_page_writes=off and
>wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
>discussed in the other thread.
>

OK, so how did you test that? I'll do some tests with a traditional
streaming replication setup, multiple sessions on the primary (and maybe
a weaker storage system on the replica). I suppose that's another setup
that should benefit from this.

> ...
>
>Earlier work, and how this patch compares:
>
>* Sean Chittenden wrote pg_prefaulter[1], an external process that
>uses worker threads to pread() referenced pages some time before
>recovery does, and demonstrated very good speed-up, triggering a lot
>of discussion of this topic. My WIP patch differs mainly in that it's
>integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
>than synchronous I/O from worker threads/processes. Sean wouldn't
>have liked my patch much because he was working on ZFS and that
>doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
>works pretty well, and I'll try to get that upstreamed.
>

How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
everything goes fine? I'm not sure what's the usual life-cycle, but I
assume it may take a couple years to get it on most production systems.

What other common filesystems are missing support for this?

Presumably we could do what Sean's extension does, i.e. use a couple of
bgworkers, each doing simple pread() calls. Of course, that's
unnecessarily complicated on systems that have FADV_WILLNEED.

> ...
>
>Here are some cases where I expect this patch to perform badly:
>
>* Your WAL has multiple intermixed sequential access streams (ie
>sequential access to N different relations), so that sequential access
>is not detected, and then all the WILLNEED advice prevents Linux's
>automagic readahead from working well. Perhaps that could be
>mitigated by having a system that can detect up to N concurrent
>streams, where N is more than the current 1, or by flagging buffers in
>the WAL as part of a sequential stream. I haven't looked into this.
>

Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
one by one), and doing some sort of sorting? That should allow readahead
to kick in.

>* The data is always found in our buffer pool, so PrefetchBuffer() is
>doing nothing useful and you might as well not be calling it or doing
>the extra work that leads up to that. Perhaps that could be mitigated
>with an adaptive approach: too many PrefetchBuffer() hits and we stop
>trying to prefetch, too many XLogReadBufferForRedo() misses and we
>start trying to prefetch. That might work nicely for systems that
>start out with cold caches but eventually warm up. I haven't looked
>into this.
>

I think the question is what's the cost of doing such unnecessary
prefetch. Presumably it's fairly cheap, especially compared to the
opposite case (not prefetching a block not in shared buffers). I wonder
how expensive would the adaptive logic be on cases that never need a
prefetch (i.e. datasets smaller than shared_buffers).

>* The data is actually always in the kernel's cache, so the advice is
>a waste of a syscall. That might imply that you should probably be
>running with a larger shared_buffers (?). It's technically possible
>to ask the operating system if a region is cached on many systems,
>which could in theory be used for some kind of adaptive heuristic that
>would disable pointless prefetching, but I'm not proposing that.
>Ultimately this problem would be avoided by moving to true async I/O,
>where we'd be initiating the read all the way into our buffers (ie it
>replaces the later pread() so it's a wash, at worst).
>

Makes sense.

>* The prefetch distance is set too low so that pread() waits are not
>avoided, or your storage subsystem can't actually perform enough
>concurrent I/O to get ahead of the random access pattern you're
>generating, so no distance would be far enough ahead. To help with
>the former case, perhaps we could invent something smarter than a
>user-supplied distance (something like "N cold block references
>ahead", possibly using effective_io_concurrency, rather than "N bytes
>ahead").
>

In general, I find it quite non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.

But how do you deduce the WAL distance from that? I don't know. Plus
right after the checkpoint the WAL contains FPW, reducing the number of
blocks in a given amount of WAL (compared to right before a checkpoint).
So I expect users might pick unnecessarily high WAL distance. OTOH with
FPW we don't quite need agressive prefetching, right?

Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).

Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-01-02 18:34:44 Re: backup manifests
Previous Message David Fetter 2020-01-02 18:03:23 Re: backup manifests