Re: WIP: WAL prefetch (another approach)

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2020-04-25 19:19:35
Message-ID: 20200425191935.4p6qxzzu5bklodwk@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Tue, Apr 21, 2020 at 05:26:52PM +1200, Thomas Munro wrote:
>
> One report I heard recently said that if you get rid of I/O stalls,
> pread() becomes cheap enough that the much higher frequency lseek()
> calls I've complained about elsewhere[1] become the main thing
> recovery is doing, at least on some systems, but I haven't pieced
> together the conditions required yet. I'd be interested to know if
> you see that.

At the moment I've performed couple of tests for the replication in case
when almost everything is in memory (mostly by mistake, I was expecting
that a postgres replica within a badly memory limited cgroup will cause
more IO, but looks like kernel do not evict pages anyway). Not sure if
that's what you mean by getting rid of IO stalls, but in these tests
profiling shows lseek & pread appear in similar amount of samples.

If I understand correctly, eventually one can measure prefetching
influence by looking at different redo function execution time (assuming
that data they operate with is already prefetched they should be
faster). I still have to clarify what is the exact reason, but even in
the situation described above (in memory) there is some visible
difference, e.g.

# with prefetch
Function = b'heap2_redo' [8064]
nsecs : count distribution
4096 -> 8191 : 1213 | |
8192 -> 16383 : 66639 |****************************************|
16384 -> 32767 : 27846 |**************** |
32768 -> 65535 : 873 | |

# without prefetch
Function = b'heap2_redo' [17980]
nsecs : count distribution
4096 -> 8191 : 1 | |
8192 -> 16383 : 66997 |****************************************|
16384 -> 32767 : 30966 |****************** |
32768 -> 65535 : 1602 | |

# with prefetch
Function = b'btree_redo' [8064]
nsecs : count distribution
2048 -> 4095 : 0 | |
4096 -> 8191 : 246 |****************************************|
8192 -> 16383 : 5 | |
16384 -> 32767 : 2 | |

# without prefetch
Function = b'btree_redo' [17980]
nsecs : count distribution
2048 -> 4095 : 0 | |
4096 -> 8191 : 82 |******************** |
8192 -> 16383 : 19 |**** |
16384 -> 32767 : 160 |****************************************|

Of course it doesn't take into account time we spend doing extra
syscalls for prefetching, but still can give some interesting
information.

> > I like the idea of extensible PrefetchBufferResult. Just one commentary,
> > if I understand correctly the way how it is being used together with
> > prefetch_queue assumes one IO operation at a time. This limits potential
> > extension of the underlying code, e.g. one can't implement some sort of
> > buffering of requests and submitting an iovec to a sycall, then
> > prefetch_queue will no longer correctly represent inflight IO. Also,
> > taking into account that "we don't have any awareness of when I/O really
> > completes", maybe in the future it makes to reconsider having queue in
> > the prefetcher itself and rather ask for this information from the
> > underlying code?
>
> Yeah, you're right that it'd be good to be able to do some kind of
> batching up of these requests to reduce system calls. Of course
> posix_fadvise() doesn't support that, but clearly in the AIO future[2]
> it would indeed make sense to buffer up a few of these and then make a
> single call to io_uring_enter() on Linux[3] or lio_listio() on a
> hypothetical POSIX AIO implementation[4]. (I'm not sure if there is a
> thing like that on Windows; at a glance, ReadFileScatter() is
> asynchronous ("overlapped") but works only on a single handle so it's
> like a hypothetical POSIX aio_readv(), not like POSIX lio_list()).
>
> Perhaps there could be an extra call PrefetchBufferSubmit() that you'd
> call at appropriate times, but you obviously can't call it too
> infrequently.
>
> As for how to make the prefetch queue a reusable component, rather
> than having a custom thing like that for each part of our system that
> wants to support prefetching: that's a really good question. I didn't
> see how to do it, but maybe I didn't try hard enough. I looked at the
> three users I'm aware of, namely this patch, a btree prefetching patch
> I haven't shared yet, and the existing bitmap heap scan code, and they
> all needed to have their own custom book keeping for this, and I
> couldn't figure out how to share more infrastructure. In the case of
> this patch, you currently need to do LSN based book keeping to
> simulate "completion", and that doesn't make sense for other users.
> Maybe it'll become clearer when we have support for completion
> notification?

Yes, definitely.

> Some related questions are why all these parts of our system that know
> how to prefetch are allowed to do so independently without any kind of
> shared accounting, and why we don't give each tablespace (= our model
> of a device?) its own separate queue. I think it's OK to put these
> questions off a bit longer until we have more infrastructure and
> experience. Our current non-answer is at least consistent with our
> lack of an approach to system-wide memory and CPU accounting... I
> personally think that a better XLogReader that can be used for
> prefetching AND recovery would be a higher priority than that.

Sure, this patch is quite valuable as it is, and those questions I've
mentioned are targeting mostly future development.

> > Maybe it was discussed in the past in other threads. But if I understand
> > correctly, this implementation weights all the samples. Since at the
> > moment it depends directly on replaying speed (so a lot of IO involved),
> > couldn't it lead to a single outlier at the beginning skewing this value
> > and make it less useful? Does it make sense to decay old values?
>
> Hmm.
>
> I wondered about a reporting one or perhaps three exponential moving
> averages (like Unix 1/5/15 minute load averages), but I didn't propose
> it because: (1) In crash recovery, you can't query it, you just get
> the log message at the end, and mean unweighted seems OK in that case,
> no? (you are not more interested in the I/O saturation at the end of
> the recovery compared to the start of recovery are you?), and (2) on a
> streaming replica, if you want to sample the instantaneous depth and
> compute an exponential moving average or some more exotic statistical
> concoction in your monitoring tool, you're free to do so. I suppose
> (2) is an argument for removing the existing average completely from
> the stat view; I put it in there at Andres's suggestion, but I'm not
> sure I really believe in it. Where is our average replication lag,
> and why don't we compute the stddev of X, Y or Z? I think we should
> provide primary measurements and let people compute derived statistics
> from those.

For once I disagree, since I believe this very approach, widely applied,
leads to a slightly chaotic situation with monitoring. But of course
you're right, it has nothing to do with the patch itself. I also would
be in favour of removing the existing averages, unless Andres has more
arguments to keep it.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hamid Akhtar 2020-04-25 19:29:11 Improving psql slash usage help message
Previous Message Tom Lane 2020-04-25 14:36:06 Re: Anybody want to check for Windows timezone updates?