Re: WAL prefetch

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Sean Chittenden <seanc(at)joyent(dot)com>
Subject: Re: WAL prefetch
Date: 2018-06-19 16:34:22
Message-ID: baa76c0f-ac18-851f-8181-316629fc7ee4@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19.06.2018 18:50, Andres Freund wrote:
> On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
>> I do not think that prefetching in shared buffers requires much more efforts
>> and make patch more envasive...
>> It even somehow simplify it, because there is no to maintain own cache of
>> prefetched pages...
>> But it will definitely have much more impact on Postgres performance:
>> contention for buffer locks, throwing away pages accessed by read-only
>> queries,...
> These arguments seem bogus to me. Otherwise the startup process is going
> to do that work.

There is just one process replaying WAL. Certainly it has some impact on
hot standby query execution.
But if there will be several prefetch workers (128???) then this impact
will be dramatically increased.

>
>> Also there are two points which makes prefetching into shared buffers more
>> complex:
>> 1. Need to spawn multiple workers to make prefetch in parallel and somehow
>> distribute work between them.
> I'm not even convinced that's true. It doesn't seem insane to have a
> queue of, say, 128 requests that are done with posix_fadvise WILLNEED,
> where the oldest requests is read into shared buffers by the
> prefetcher. And then discarded from the page cache with WONTNEED. I
> think we're going to want a queue that's sorted in the prefetch process
> anyway, because there's a high likelihood that we'll otherwise issue
> prfetch requets for the same pages over and over again.
>
> That gets rid of most of the disadvantages: We have backpressure
> (because the read into shared buffers will block if not yet ready),
> we'll prevent double buffering, we'll prevent the startup process from
> doing the victim buffer search.
>
>
>> Concerning WAL perfetch I still have a serious doubt if it is needed at all:
>> if checkpoint interval is less than size of free memory at the system, then
>> redo process should not read much.
> I'm confused. Didn't you propose this? FWIW, there's a significant
> number of installations where people have observed this problem in
> practice.

Well, originally it was proposed by Sean - the author of pg-prefaulter.
I just ported it from GO to C using standard PostgreSQL WAL iterator.
Then I performed some measurements and didn't find some dramatic
improvement in performance (in case of synchronous replication) or
reducing replication lag for asynchronous replication neither at my
desktop (SSD, 16Gb RAM, local replication within same computer, pgbench
scale 1000), neither at pair of two powerful servers connected by
InfiniBand and 3Tb NVME (pgbench with scale 100000).
Also I noticed that read rate at replica is almost zero.
What does it mean:
1. I am doing something wrong.
2. posix_prefetch is not so efficient.
3. pgbench is not right workload to demonstrate effect of prefetch.
4. Hardware which I am using is not typical.

So it make me think when such prefetch may be needed... And it caused
new questions:
I wonder how frequently checkpoint interval is much larger than OS cache?
If we enforce full pages writes (let's say each after each 1Gb), how it
affect wal size and performance?

Looks like it is difficult to answer the second question without
implementing some prototype.
May be I will try to do it.
>> And if checkpoint interval is much larger than OS cache (are there cases
>> when it is really needed?)
> Yes, there are. Percentage of FPWs can cause serious problems, as do
> repeated writouts by the checkpointer.

One more consideration: data is written to the disk as blocks in any
case. If you updated just few bytes on a page, then still the whole page
has to be written in database file.
So avoiding full page writes allows to reduce WAL size and amount of
data written to the WAL, but not amount of data written to the database
itself.
It means that if we completely eliminate FPW and transactions are
updating random pages, then disk traffic is reduced less than two times...

>
>
>> then quite small patch (as it seems to me now) forcing full page write
>> when distance between page LSN and current WAL insertion point exceeds
>> some threshold should eliminate random reads also in this case.
> I'm pretty sure that that'll hurt a significant number of installations,
> that set the timeout high, just so they can avoid FPWs.
May be, but I am not so sure. This is why I will try to investigate it more.

> Greetings,
>
> Andres Freund

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-06-19 16:36:09 Re: Fast default stuff versus pg_upgrade
Previous Message Andres Freund 2018-06-19 16:33:17 Re: Fast default stuff versus pg_upgrade