Re: WIP: WAL prefetch (another approach)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2020-09-23 23:38:45
Message-ID: CA+hUKG+2Vw3UAVNJSfz5_zhRcHUWEBDrpB7pyQ85Yroep0AKbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 9, 2020 at 11:16 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> OK, thanks for looking into this. I guess I'll wait for an updated patch
> before testing this further. The storage has limited capacity so I'd
> have to either reduce the amount of data/WAL or juggle with the WAL
> segments somehow. Doesn't seem worth it.

Here's a new WIP version that works for archive-based recovery in my tests.

The main change I have been working on is that there is now just a
single XLogReaderState, so no more double-reading and double-decoding
of the WAL. It provides XLogReadRecord(), as before, but now you can
also read further ahead with XLogReadAhead(). The user interface is
much like before, except that the GUCs changed a bit. They are now:

recovery_prefetch=on
recovery_prefetch_fpw=off
wal_decode_buffer_size=256kB
maintenance_io_concurrency=10

I recommend setting maintenance_io_concurrency and
wal_decode_buffer_size much higher than those defaults.

There are a few TODOs and questions remaining. One issue I'm
wondering about is whether it is OK that bulky FPI data is now
memcpy'd into the decode buffer, whereas before we avoided that
sometimes, when it didn't happen to cross a page boundary; I have some
ideas on how to do better (basically two levels of ring buffer) but I
haven't looked into that yet. Another issue is the new 'nowait' API
for the page-read callback; I'm trying to figure out if that is
sufficient, or something more sophisticated including perhaps a
different return value is required. Another thing I'm wondering about
is whether I have timeline changes adequately handled.

This design opens up a lot of possibilities for future performance
improvements. Some example:

1. By adding some workspace to decoded records, the prefetcher can
leave breadcrumbs for XLogReadBufferForRedoExtended(), so that it
usually avoids the need for a second buffer mapping table lookup.
Incidentally this also skips the hot smgropen() calls that Jakub
complained about. I have an added an experimental patch like that,
but I need to look into the interlocking some more.

2. By inspecting future records in the record->next chain, a redo
function could merge work in various ways in quite a simple and
localised way. A couple of examples:
2.1. If there is a sequence of records of the same type touching the
same page, you could process all of them while you have the page lock.
2.2. If there is a sequence of relation extensions (say, a sequence
of multi-tuple inserts to the end of a relation, as commonly seen in
bulk data loads) then instead of generating a many pwrite(8KB of
zeroes) syscalls record-by-record to extend the relation, a single
posix_fallocate(1MB) could extend the file in one shot. Assuming the
bgwriter is running and doing a good job, this would remove most of
the system calls from bulk-load-recovery.

3. More sophisticated analysis could find records to merge that are a
bit further apart, under carefully controlled conditions; for example
if you have a sequence like heap-insert, btree-insert, heap-insert,
btree-insert, ... then a simple next-record system like 2 won't see
the opportunities, but something a teensy bit smarter could.

4. Since the decoding buffer can be placed in shared memory (decoded
records contain pointers but only don't point to any other memory
region, with the exception of clearly marked oversized records), we
could begin to contemplate handing work off to other processes, given
a clever dependency analysis scheme and some more infrastructure.

Attachment Content-Type Size
v11-0001-Add-pg_atomic_unlocked_add_fetch_XXX.patch text/x-patch 3.4 KB
v11-0002-Improve-information-about-received-WAL.patch text/x-patch 7.8 KB
v11-0003-Provide-XLogReadAhead-to-decode-future-WAL-recor.patch text/x-patch 59.7 KB
v11-0004-Prefetch-referenced-blocks-during-recovery.patch text/x-patch 63.9 KB
v11-0005-WIP-Avoid-extra-buffer-lookup-when-prefetching-W.patch text/x-patch 10.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul A Jungwirth 2020-09-24 00:04:57 Re: range_agg
Previous Message Tom Lane 2020-09-23 23:37:54 Re: Lift line-length limit for pg_service.conf