Re: WIP: WAL prefetch (another approach)

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2020-03-18 01:47:54
Message-ID: 20200318014753.GA16720@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-Mar-17, Thomas Munro wrote:

Hi Thomas

> On Sat, Mar 14, 2020 at 10:15 AM Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:

> > I didn't manage to go over 0005 though, but I agree with Tomas that
> > having this be configurable in terms of bytes of WAL is not very
> > user-friendly.
>
> The primary control is now maintenance_io_concurrency, which is
> basically what Tomas suggested.

> The byte-based control is just a cap to prevent it reading a crazy
> distance ahead, that also functions as the on/off switch for the
> feature. In this version I've added "max" to the name, to make that
> clearer.

Mumble. I guess I should wait to comment on this after reading 0005
more in depth.

> > First of all, let me join the crowd chanting that this is badly needed;
> > I don't need to repeat what Chittenden's talk showed. "WAL recovery is
> > now 10x-20x times faster" would be a good item for pg13 press release,
> > I think.
>
> We should be careful about over-promising here: Sean basically had a
> best case scenario for this type of techology, partly due to his 16kB
> filesystem blocks. Common results may be a lot more pedestrian,
> though it could get more interesting if we figure out how to get rid
> of FPWs...

Well, in my mind it's an established fact that our WAL replay uses far
too little of the available I/O speed. I guess if the system is
generating little WAL, then this change will show no benefit, but that's
not the kind of system that cares about this anyway -- for the others,
the parallelisation gains will be substantial, I'm sure.

> > > From a61b4e00c42ace5db1608e02165f89094bf86391 Mon Sep 17 00:00:00 2001
> > > From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> > > Date: Tue, 3 Dec 2019 17:13:40 +1300
> > > Subject: [PATCH 1/5] Allow PrefetchBuffer() to be called with a SMgrRelation.
> > >
> > > Previously a Relation was required, but it's annoying to have
> > > to create a "fake" one in recovery.

> While staring at this, I decided that SharedPrefetchBuffer() was a
> weird word order, so I changed it to PrefetchSharedBuffer(). Then, by
> analogy, I figured I should also change the pre-existing function
> LocalPrefetchBuffer() to PrefetchLocalBuffer(). Do you think this is
> an improvement?

Looks good. I doubt you'll break anything by renaming that routine.

> > > From acbff1444d0acce71b0218ce083df03992af1581 Mon Sep 17 00:00:00 2001
> > > From: Thomas Munro <tmunro(at)postgresql(dot)org>
> > > Date: Mon, 9 Dec 2019 17:10:17 +1300
> > > Subject: [PATCH 2/5] Rename GetWalRcvWriteRecPtr() to GetWalRcvFlushRecPtr().
> > >
> > > The new name better reflects the fact that the value it returns
> > > is updated only when received data has been flushed to disk.
> > >
> > > An upcoming patch will make use of the latest data that was
> > > written without waiting for it to be flushed, so use more
> > > precise function names.
> >
> > Ugh. (Not for your patch -- I mean for the existing naming convention).
> > It would make sense to rename WalRcvData->receivedUpto in this commit,
> > maybe to flushedUpto.
>
> Ok, I renamed that variable and a related one. There are more things
> you could rename if you pull on that thread some more, including
> pg_stat_wal_receiver's received_lsn column, but I didn't do that in
> this patch.

+1 for that approach. Maybe we'll want to rename the SQL-visible name,
but I wouldn't burden this patch with that, lest we lose the entire
series to that :-)

> > > + pg_atomic_uint64 writtenUpto;
> >
> > Are we already using uint64s for XLogRecPtrs anywhere? This seems
> > novel. Given this, I wonder if the comment near "mutex" needs an
> > update ("except where atomics are used"), or perhaps just move the
> > member to after the line with mutex.
>
> Moved.

LGTM.

> We use [u]int64 in various places in the replication code. Ideally
> I'd have a magic way to say atomic<XLogRecPtr> so I didn't have to
> assume that pg_atomic_uint64 is the right atomic integer width and
> signedness, but here we are. In dsa.h I made a special typedef for
> the atomic version of something else, but that's because the size of
> that thing varied depending on the build, whereas our LSNs are of a
> fixed width that ought to be en... <trails off>.

Let's rewrite Postgres in Rust ...

> > I didn't understand the purpose of inc_counter() as written. Why not
> > just pg_atomic_fetch_add_u64(..., 1)?
>
> I didn't want counters that wrap at ~4 billion, but I did want to be
> able to read and write concurrently without tearing. Instructions
> like "lock xadd" would provide more guarantees that I don't need,
> since only one thread is doing all the writing and there's no ordering
> requirement. It's basically just counter++, but some platforms need a
> spinlock to perform atomic read and write of 64 bit wide numbers, so
> more hoop jumping is required.

Ah, I see, you don't want lock xadd ... That's non-obvious. I suppose
the function could use more commentary on *why* you're doing it that way
then.

> > > /*
> > > * smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
> > > + *
> > > + * In recovery only, this can return false to indicate that a file
> > > + * doesn't exist (presumably it has been dropped by a later WAL
> > > + * record).
> > > */
> > > -void
> > > +bool
> > > smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
> >
> > I think this API, where the behavior of a low-level module changes
> > depending on InRecovery, is confusingly crazy. I'd rather have the
> > callers specifying whether they're OK with a file that doesn't exist.
>
> Hmm. But... md.c has other code like that. It's true that I'm adding
> InRecovery awareness to a function that didn't previously have it, but
> that's just because we previously had no reason to prefetch stuff in
> recovery.

True. I'm uncomfortable about it anyway. I also noticed that
_mdfd_getseg() already has InRecovery-specific behavior flags.
Clearly that ship has sailed. Consider my objection^W comment withdrawn.

> > Umm, I would keep the return values of both these functions in sync.
> > It's really strange that PrefetchBuffer does not return
> > PrefetchBufferResult, don't you think?
>
> Agreed, and changed. I suspect that other users of the main
> PrefetchBuffer() call will eventually want that, to do a better job of
> keeping the request queue full, for example bitmap heap scan and
> (hypothetical) btree scan with prefetch.

LGTM.

As before, I didn't get to reading 0005 in depth.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-03-18 01:49:10 Re: Autovacuum on partitioned table
Previous Message Tom Lane 2020-03-18 01:45:35 Re: proposal: new polymorphic types - commontype and commontypearray