Re: effective_io_concurrency and NVMe devices

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: effective_io_concurrency and NVMe devices
Date: 2022-06-08 08:59:38
Message-ID: 467b5a20-2ec1-dcca-e09d-18cc475e00c0@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/8/22 08:29, Jakub Wartak wrote:
>>>> The attached patch is a trivial version that waits until we're at
>>>> least
>>>> 32 pages behind the target, and then prefetches all of them. Maybe give it a
>> try?
>>>> (This pretty much disables prefetching for e_i_c below 32, but for an
>>>> experimental patch that's enough.)
>>>
>>> I've tried it at e_i_c=10 initially on David's setup.sql, and most defaults
>> s_b=128MB, dbsize=8kb but with forced disabled parallel query (for easier
>> inspection with strace just to be sure//so please don't compare times).
>>>
>>> run:
>>> a) master (e_i_c=10) 181760ms, 185680ms, 185384ms @ ~ 340MB/s and 44k
>>> IOPS (~122k IOPS practical max here for libaio)
>>> b) patched(e_i_c=10) 237774ms, 236326ms, ..as you stated it disabled
>>> prefetching, fadvise() not occurring
>>> c) patched(e_i_c=128) 90430ms, 88354ms, 85446ms, 78475ms, 74983ms,
>>> 81432ms (mean=83186ms +/- 5947ms) @ ~570MB/s and 75k IOPS (it even
>>> peaked for a second on ~122k)
>>> d) master (e_i_c=128) 116865ms, 101178ms, 89529ms, 95024ms, 89942ms
>>> 99939ms (mean=98746ms +/- 10118ms) @ ~510MB/s and 65k IOPS (rare peaks
>>> to 90..100k IOPS)
>>>
>>> ~16% benefit sounds good (help me understand: L1i cache?). Maybe it is
>>> worth throwing that patch onto more advanced / complete performance
>>> test farm too ? (although it's only for bitmap heap scans)
>
> I hope you have some future plans for this patch :)
>

I think the big challenge is to make this adaptive, i.e. work well for
access patterns that are not known in advance. The existing prefetching
works fine for our random stuff (even for nvme devices), not so much for
sequential (as demonstrated by David's example).

>> Yes, kernel certainly does it's own read-ahead, which works pretty well for
>> sequential patterns. What does
>>
>> blockdev --getra /dev/...
>>
>> say?
>
> It's default, 256 sectors (128kb) so it matches.
>

Right. I think this is pretty much why (our) prefetching performs so
poorly on sequential access patterns - the kernel read-ahead works very
well in this case, and our prefetching can't help but can interfere.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-06-08 09:15:09 Re: pg_rewind: warn when checkpoint hasn't happened after promotion
Previous Message shiy.fnst@fujitsu.com 2022-06-08 08:46:46 Replica Identity check of partition table on subscriber