Re: WAL prefetch

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Sean Chittenden <seanc(at)joyent(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, sean(at)chittenden(dot)org
Subject: Re: WAL prefetch
Date: 2018-07-09 09:59:06
Message-ID: e7116515-1fbe-06ca-a40d-530f8e899126@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/09/2018 02:26 AM, Sean Chittenden wrote:
>
> ... snip ...
>
> The real importance of prefaulting becomes apparent in the following two
> situations:
>
> 1. Priming the OS's filesystem cache, notably after an OS restart. This is of
> value to all PostgreSQL scenarios, regardless of whether or not it's a
> primary or follower. Reducing database startup/recovery times is very
> helpful, especially when recovering from an outage or after having performed
> planned maintenance. Little in PostgreSQL administration is more infuriating
> than watching PostgreSQL recover and seeing the CPU 100% idle and the disk IO
> system nearly completely idle (especially during an outage or when recovering
> from an outage).
> 2. When the following two environmental factors are true:
> a. the volume of writes to discrete pages is high
> b. the interval between subsequent writes to a single page is long enough
> that a page is evicted from both shared_buffers and the filesystem cache
>
> Write-heavy workloads tend to see this problem, especially if you're
> attempting to provide consistency in your application and do not read from
> the followers (thereby priming their OS/shared_buffer cache). If the
> workload is continuous, the follower may never be able overcome the write
> volume and the database never catches up.
>
> The pg_prefaulter was borne out of the last workload, namely a write-heavy, 24/7
> constant load with a large dataset.
>

Good, that generally matches the workload I've been using for testing.

>
> What pg_prefaulter does is read in the blocks referenced from the WAL stream
> (i.e. PG heap pages) and then load the referenced pages into the OS filesystem
> cache (via threaded calls to pread(2)). The WAL apply process has a cache-hit
> because the filesystem cache has been primed with the heap page before the apply
> process attempted to perform its read-modify-write of the heap.
>
> It is important to highlight that this is a problem because there is only one
> synchronous pread(2) call in flight at a time from the apply/recover/startup
> process, which effectively acts as the speed limit for PostgreSQL. The physics
> of many workloads are such that followers are unable to keep up and are thus
> destined to always fall behind (we've all seen this at some point, likely via
> apply lag from a VACUUM or pg_repack). The primary can schedule concurrent IO
> from multiple client all making independent SELECTS. Contrast that to a replica
> who has zero knowledge of the IOs that the primary recently dispatched, and all
> IO looks like random read and likely a cache miss. In effect, the pg_prefaulter
> raises the speed limit of the WAL apply/recovery process by priming the
> filesystem cache by snooping in on the WAL stream.
>
> PostgreSQL's WAL apply and recovery process is only capable of scheduling a
> single synchronous pread(2) syscall. As a result, even if you have an RAID10
> and a capable IO scheduler in the OS that is able to read form both halves of
> each mirror, you're only going to perform ~150-225 pread(2) calls per second.
> Despite the entire disk system being able to deliver something closer to
> 2400-3600 IOPS (~10ms random-read == ~150-225 IOPS * 16x disks), you'll only
> observe ~6% utilization of the random read IO capabilities of a server. When
> you realize the size of the unapplied WAL entries represents a backlog of queued
> or unscheduled IOs, the phrase "suboptimal" doesn't begin to do reality justice.
>
> One or more of the following activities can demonstrate the problem:
>
> * Natural random-write workloads at high velocity
> * VACUUM activity
> * pg_repack
> * slow IO subsystems on followers
> * synchronous apply
>
> Regarding the environment where pg_prefaulter was written, the server hardware
> was reasonably beefy servers (256GB RAM, 16x 10K SAS disks) and this database
> cluster was already in a scale-out configuration. Doubling the number of
> database servers would only spread the load out by 2x, but we'd still only be
> utilizing ~6% of the IO across the fleet. We needed ~100% IO utilization when
> followers were falling behind. In practice we are seeing orders of magnitude
> improvement in apply lag.
>

Yeah, the poor I/O utilization is annoying. Considering the storage is
often the most expensive part of the database system, it's a bit like
throwing money out of the window :-/

>
> Other points worth mentioning:
>
> * the checkpoint_interval was set to anywhere between 15s and 300s, it didn't
> matter - we did discover a new form of lag, however, checkpoint lag. Pages
> were being evicted from cache faster than checkpoints were able to execute,
> leading to unbounded checkpoints and WAL growth (i.e. writes were fast enough
> that the checkpointer was suffering from Cold RMW). iirc, pg_prefaulter reads
> in both WAL pages and WAL files that are about to be used in checkpoints (it's
> been a while since I wrote this code).
>

Hmmm, I'm not sure how a checkpointer could hit a cold RMW, considering
it merely writes out dirty pages from shared buffers. Although, perhaps
it's specific to ZFS setups with 16kB record sizes?

> * The pg_prefaulter saw the best performance when we increased the number of IO
> workers to be roughly equal to the available IO commands the OS could schedule
> and dispatch (i.e. 16x disks * 150 IOPS == ~2K).
>

Yeah. I wonder how would this work for flash-based storage that can
achieve much higher IOPS values.

> * pg_prefaulter is very aggressive about not performing work twice or reading
> the same page multiple times. pg_prefaulter uses a heap page cache to prevent
> redundant IOs for the same PG heap page. pg_prefaulter also dedupes IO
> requests in case the same page was referenced twice in short succession due to
> data locality in the WAL stream. The workload was already under cache
> pressure. Artificially promoting a page from the ARC MRU to MFU would result
> in potentially useful records in the MFU being evicted from cache.
>

Makes sense. I think the patch does that too, by keeping a cache of
recently prefetched blocks.

> * During the design phase, I looked into using bgworkers but given the number of
> in-flight pread(2) calls required to fully utilize the IO subsystem, I opted
> for something threaded (I was also confined to using Solaris which doesn't
> support posix_fadvise(2), so I couldn't sequentially dispatch async
> posix_fadvise(2) calls and hope for the best).
>

Hmm, yeah. I'm not sure what to do with this. Using many (thousands?) of
prefetch processes seems like a bad idea - we surely can't make them
regular bgworkers. Perhaps we could use one process with many threads?

Presumably if we knew about a better way to do prefetching without
posix_fadvise, we'd have implemented it in FilePrefetch(). But we just
error out instead :-(

> * In my testing I was successfully using pgbench(1) to simulate the workload.
> Increased the checkpoint_interval and segments to a very high number was
> sufficient. I could see the improvement for cold-start even with SSDs, but
> I'm not sure how big of an impact this would be for NVMe.
>

I think the impact on NVMe (particularly Optane) will be smaller,
because the devices handle low queue depths better, particularly for
reads. AFAIK it's the opposite for writes (higher queue depths are
needed), but writes are kinda throttled by reads (faster recovery means
more write requests). But then again, if you have multiple NVMe devices
in a RAID, that means non-trivial number of requests is needed.

> * My slides are posted and have graphs of the before and after using the
> pg_prefaulter, but I'm happy to provide additional details or answer more Q's.
>
> * It would be interesting to see if posix_fadvise(2) is actually harming
> performance. For example, spinning off a throw-away implementation that uses
> aio or a pthread worker pool + pread(2). I do remember seeing some mailing
> list blurbs from Mozilla where they were attempting to use posix_fadvise(2)
> and were seeing a slow down in performance on Linux (I believe this has since
> been fixed, but it wouldn't surprise me if there were still unintended
> consequences from this syscall).
>

Not sure, but in this case we can demonstrate it clearly helps. Maybe
there's an alternative way to do async prefetching, performing better
(say, aio or whatever), but I've seen plenty of issues with those too.

> * I have a port of pg_prefaulter that supports PostgreSQL >= 9 ~complete, but
> not pushed. I'll see if I can get to that this week. For "reasons" this
> isn't a high priority for me at the moment, but I'm happy to help out and see
> this move forward.
>

Good to hear that.

> * Tomas, feel free to contact me offline to discuss why the pg_prefault isn't
> working for you. I have it running on Linux, FreeBSD, illumos, and macOS.
>

Will do. It can easily be due to my lack of golang knowledge, or
something similarly silly.

> * In the graph from Tomas (nice work, btw), it's clear that the bandwidth is the
> same. The way that we verified this earlier was to run ~10-15min traces and
> capture the file and offset of every read of PostgreSQL and pg_prefaulter. We
> saw pg_prefaulter IOs be ~100% cache miss. For PostgreSQL, we could observe
> that ~99% of its IO was cache hit. We also verified that pg_prefaulter wasn't
> doing any IO that wasn't eventually performed by PostgreSQL by comparing the
> IOs performed against each heap segment.
>

I'm not sure what bandwidth?

>
> * "In this case I see that without prefetching, the replay process uses about
> 20% of a CPU. With prefetching increases this to ~60%, which is nice." With
> the pg_prefaulter, the IO should hit 100% utilization. Like I mentioned
> above, Tomas, I'd like to make sure you get this working so you can compare
> and improve as necessary. :~] I never got CPU utilization to 100%, but I did
> get disk IO utilization to 100%, and that to me was the definition of success.
> CPU utilization of the apply process could become 100% utilized with fast
> enough disks but in production I didn't have anything that wasn't spinning
> rust.
>

Not sure 100% is really achievable, but we can try. There's room for
improvement, that's for sure.

> * It looks like we're still trying to figure out the nature of this problem and
> the cost of various approaches. From a rapid prototyping perspective, feel
> free to suggest changes to the Go pg_prefaulter and toss the various
> experiments behind a feature flag.
>
> * "> But it is implemented in GO and using pg_waldump.
> Yeah, that's not too good if we want it in core."
> I fail to see the problem with a side-car in Go. *checks calendar* :~]
>

I think there's a couple of valid reasons for that. It's not that we're
somehow against Go in principle, but adding languages into a code base
makes it more difficult to maintain it. Also, if we want to integrate it
with core (start it automatically on replicas, make it access internal
state etc.) it's just easier to do that from C.

It can be done from a standalone tool (say, an extension written in Go).
But then why make it part of core at all? That has disadvantages too,
like coupling the release cycle etc.

> * pg_prefaulter will not activate if the number of unapplied WAL pages is less
> than the size of 1 WAL segment (i.e. 16MB). This could be tuned further, but
> this low-water mark seems to work well.
>
> * pg_prefaulter won't read-ahead more than 100x WAL segments into the future. I
> made the unverified assumption that PostgreSQL could not process more than
> 1.6GB of WAL (~1/2 of the memory bandwidth available for a single process) in
> less than the rotational latency of a random IO (i.e. ~10ms), and that
> pg_prefaulter could in theory stay ahead of PG. PostgreSQL normally overtakes
> pg_prefaulter's ability to fault in random pages due to disk IO limitations
> (PostgreSQL's cache hit rate is ~99.9%, not 100% for this very reason). In
> practice this has worked out, but I'm sure there's room for improvement with
> regards to setting the high-watermark and reducing this value. #yoloconstants
>

I think there's a stable state where the recovery reaches maximum
performance and we don't prefetch pages too far ahead (at some point the
recovery speed will stop improving, and eventually start decreasing
because we'll end up pushing out pages we've prefetched). I wonder how
we could auto-tune this.

> * I contemplated not reading in FPW but this would have been detrimental on ZFS
> because ZFS is a copy-on-write filesystem (vs block filesystem). For ZFS, we
> are using a 16K record size, compressed down to ~8K. We have to read the
> entire record in before we can modify half of the page. I suspect eliding
> prefaulting FPWs will always be a performance loss for nearly all hardware.
>

That's a good point - on regular filesystems with small pages we can
just skip FPW (in fact, we should treat them as prefetched), while on
ZFS we need to prefetch them. We probably need to make this configurable.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2018-07-09 10:00:51 Allowing multiple DDL commands to run simultaneously
Previous Message Brahmam Eswar 2018-07-09 09:58:45 How to set array element to null value