Re: WAL prefetch

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Sean Chittenden <seanc(at)joyent(dot)com>
Subject: Re: WAL prefetch
Date: 2018-06-19 13:03:59
Message-ID: 75102f8c-3659-2c74-c9fc-8fbf70d5b525@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/19/2018 02:33 PM, Konstantin Knizhnik wrote:
>
> On 19.06.2018 14:03, Tomas Vondra wrote:
>>
>> On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:
>>>
>>> ...
>>>
>>> Also there are two points which makes prefetching into shared buffers
>>> more complex:
>>> 1. Need to spawn multiple workers to make prefetch in parallel and
>>> somehow distribute work between them.
>>> 2. Synchronize work of recovery process with prefetch to prevent
>>> prefetch to go too far and doing useless job.
>>> The same problem exists for prefetch in OS cache, but here risk of
>>> false prefetch is less critical.
>>>
>>
>> I think the main challenge here is that all buffer reads are currently
>> synchronous (correct me if I'm wrong), while the posix_fadvise()
>> allows a to prefetch the buffers asynchronously.
>
> Yes, this is why we have to spawn several concurrent background workers
> to perfrom prefetch.

Right. My point is that while spawning bgworkers probably helps, I don't
expect it to be enough to fill the I/O queues on modern storage systems.
Even if you start say 16 prefetch bgworkers, that's not going to be
enough for large arrays or SSDs. Those typically need way more than 16
requests in the queue.

Consider for example [1] from 2014 where Merlin reported how S3500
(Intel SATA SSD) behaves with different effective_io_concurrency values:

[1]
https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com

Clearly, you need to prefetch 32/64 blocks or so. Consider you may have
multiple such devices in a single RAID array, and that this device is
from 2014 (and newer flash devices likely need even deeper queues).

ISTM a small number of bgworkers is not going to be sufficient. It might
be enough for WAL prefetching (where we may easily run into the
redo-is-single-threaded bottleneck), but it's hardly a solution for
bitmap heap scans, for example. We'll need to invent something else for
that.

OTOH my guess is that whatever solution we'll end up implementing for
bitmap heap scans, it will be applicable for WAL prefetching too. Which
is why I'm suggesting simply using posix_fadvise is not going to make
the direct I/O patch significantly more complicated.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Finzel 2018-06-19 13:25:33 Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
Previous Message Etsuro Fujita 2018-06-19 12:46:17 Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.