Re: WAL prefetch

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Sean Chittenden <seanc(at)joyent(dot)com>
Subject: Re: WAL prefetch
Date: 2018-06-15 07:38:56
Message-ID: 6e47f1fe-5bd3-0190-c3c1-69b8a291dd26@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15.06.2018 07:36, Amit Kapila wrote:
> On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>> The speed of synchronous replication between two nodes is increased from 56k
>>> TPS to 60k TPS (on pgbench with scale 1000).
>> I'm also surprised that it wasn't a larger improvement.
>>
>> Seems like it would make sense to implement in core using
>> posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
>> or nearby.. At least, that's the thinking I had when I was chatting w/
>> Sean.
>>
> Doing in-core certainly has some advantage such as it can easily reuse
> the existing xlog code rather trying to make a copy as is currently
> done in the patch, but I think it also depends on whether this is
> really a win in a number of common cases or is it just a win in some
> limited cases.
>
I am completely agree. It was my mail concern: on which use cases this
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then
first update of a page since last checkpoint will be written in WAL as
full page and applying it will not require reading any data from disk.
If this pages is updated multiple times in subsequent transactions, then
most likely it will be still present in OS file cache, unless checkpoint
interval exceeds OS cache size (amount of free memory in the system). So
if this conditions are satisfied then looks like prefetch is not needed.
And it seems to be true for most real configurations: checkpoint
interval is rarely set larger than hundred of gigabytes and modern
servers usually have more RAM.

But once this condition is not satisfied and lag is larger than size of
OS cache, then prefetch can be not efficient because prefetched pages
may be thrown away from OS cache before them are actually accessed by
redo process. In this case extra synchronization between prefetch and
replay processes is needed so that prefetch is not moving too far away
from replayed LSN.

It is not a problem to integrate this code in Postgres core and run it
in background worker. I do not think that performing prefetch in wal
receiver process itself is good idea: it may slow down speed of
receiving changes from master. And in this case I really can throw away
cut&pasted code. But it is easier to experiment with extension rather
than with patch to Postgres core.
And I have published this extension to make it possible to perform
experiments and check whether it is useful on real workloads.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-06-15 07:41:32 Re: Getting "ERROR: did not find all requested child rels in append_rel_list" when enable_partition_pruning is on
Previous Message Rajkumar Raghuwanshi 2018-06-15 07:32:51 Getting "ERROR: did not find all requested child rels in append_rel_list" when enable_partition_pruning is on