Re: WIP: WAL prefetch (another approach)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-03-17 23:00:04
Message-ID: 28c3352e-5388-6410-aae7-05907079e6c9@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 3/17/21 10:43 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas(dot)vondra(at)enterprisedb(dot)com) wrote:
>> Right, I was just going to point out the FPIs are not necessary - what
>> matters is the presence of long streaks of WAL records touching the same
>> set of blocks. But people with workloads where this is common likely
>> don't need the WAL prefetching at all - the replica can keep up just
>> fine, because it doesn't need to do much I/O anyway (and if it can't
>> then prefetching won't help much anyway). So just don't enable the
>> prefetching, and there'll be no overhead.
>
> Isn't this exactly the common case though..? Checkpoints happening
> every 5 minutes, the replay of the FPI happens first and then the record
> is updated and everything's in SB for the later changes?

Well, as I said before, the FPIs are not very significant - you'll have
mostly the same issue with any repeated changes to the same block. It
does not matter much if you do

FPI for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

or just

WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1

In both cases some of the prefetches are probably unnecessary. But the
frequency of checkpoints does not really matter, the important bit is
repeated changes to the same block(s).

If you have active set much larger than RAM, this is quite unlikely. And
we know from the pgbench tests that prefetching has a huge positive
effect in this case.

On smaller active sets, with frequent updates to the same block, we may
issue unnecessary prefetches - that's true. But (a) you have not shown
any numbers suggesting this is actually an issue, and (b) those cases
don't really need prefetching because all the data is already either in
shared buffers or in page cache. So if it happens to be an issue, the
user can simply disable it.

So how exactly would a problematic workload look like?

> You mentioned elsewhere that this would improve 80% of cases but that
> doesn't seem to be backed up by anything and certainly doesn't seem
> likely to be the case if we're talking about across all PG
> deployments.

Obviously, the 80% was just a figure of speech, illustrating my belief
that the proposed patch is beneficial for most users who currently have
issues with replication lag. That is based on my experience with support
customers who have such issues - it's almost invariably an OLTP workload
with large active set, and we know (from the benchmarks) that in these
cases it helps.

Users who don't have issues with replication lag can disable (or not
enable) the prefetching, and won't get any negative effects.

Perhaps there are users with weird workloads that have replication lag
issues but this patch won't help them - bummer, we can't solve
everything in one go. Also, no one actually demonstrated such workload
in this thread so far.

But as you're suggesting we don't have data to support the claim that
this actually helps many users (with no risk to others), I'd point out
you have not actually provided any numbers showing that it actually is
an issue in practice.

> I also disagree that asking the kernel to go do random I/O for us,
> even as a prefetch, is entirely free simply because we won't
> actually need those pages. At the least, it potentially pushes out
> pages that we might need shortly from the filesystem cache, no?
Where exactly did I say it's free? I said that workloads where this
happens a lot most likely don't need the prefetching at all, so it can
be simply disabled, eliminating all negative effects.

Moreover, looking at a limited number of recently prefetched blocks
won't eliminate this problem anyway - imagine a random OLTP on large
data set that however fits into RAM. After a while no read I/O needs to
be done, but you'd need pretty much infinite list of prefetched blocks
to eliminate that, and with smaller lists you'll still do 99% of the
prefetches.

Just disabling prefetching on such instances seems quite reasonable.

>> If it was up to me, I'd just get the patch committed as is. Delaying the
>> feature because of concerns that it might have some negative effect in
>> some cases, when that can be simply mitigated by disabling the feature,
>> is not really beneficial for our users.
>
> I don't know that we actually know how many cases it might have a
> negative effect on or what the actual amount of such negative case there
> might be- that's really why we should probably try to actually benchmark
> it and get real numbers behind it, particularly when the chances of
> running into such a negative effect with the default configuration (that
> is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more
> likely to occur in the field than the cases where FPWs are disabled and
> someone's running on ZFS.
>
> Perhaps more to the point, it'd be nice to see how this change actually
> improves the caes where PG is running with more-or-less the defaults on
> the more commonly deployed filesystems. If it doesn't then maybe it
> shouldn't be the default..? Surely the folks running on ZFS and running
> with FPWs disabled would be able to manage to enable it if they
> wished to and we could avoid entirely the question of if this has a
> negative impact on the more common cases.
>
> Guess I'm just not a fan of pushing out a change that will impact
> everyone by default, in a possibly negative way (or positive, though
> that doesn't seem terribly likely, but who knows), without actually
> measuring what that impact will look like in those more common cases.
> Showing that it's a great win when you're on ZFS or running with FPWs
> disabled is good and the expected best case, but we should be
> considering the worst case too when it comes to performance
> improvements.
>

Well, maybe it'll behave differently on systems with ZFS. I don't know,
and I have no such machine to test that at the moment. My argument
however remains the same - if if happens to be a problem, just don't
enable (or disable) the prefetching, and you get the current behavior.

FWIW I'm not sure there was a discussion or argument about what should
be the default setting (enabled or disabled). I'm fine with not enabling
this by default, so that people have to enable it explicitly.

In a way that'd be consistent with effective_io_concurrency being 1 by
default, which almost disables regular prefetching.

> Anyhow, ultimately I don't know that there's much more to discuss on
> this thread with regard to this particular topic, at least. As I said
> before, if everyone else is on board and not worried about it then so be
> it; I feel that at least the concern that I raised has been heard.
>

OK, thanks for the discussions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-03-17 23:04:47 replication slot stats memory bug
Previous Message Thomas Munro 2021-03-17 22:38:40 Re: Calendar support in localization