Quick Links

Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From:	Cedric Villemain <Cedric(dot)Villemain+pgsql(at)abcSQL(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
Date:	2024-01-03 23:23:43
Message-ID:	2f91cf01-3ce6-45c6-aa9d-72e4953264df@abcSQL.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I wonder what you think of making pg_prewarm use recent addition on
smgrprefetch and readv ?

In order to try, I did it anyway in the attached patches. They contain
no doc update, but I will proceed if it is of interest.

In summary:

1. The first one adds a new check on parameters (checking last block is
indeed not before first block).
Consequence is an ERROR is raised instead of silently doing nothing.

2. The second one does implement smgrprefetch with range and loops by
default per segment to still have a check for interrupts.

3. The third one provides smgrreadv instead of smgrread, by default on
a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Q: posix_fadvise may not work exactly the way you think it does, or does
it ?

In details, and for the question:

It's not so obvious that the "feature" is really required or wanted,
depending on what are the expectations from user point of view.

The kernel decides on what to do with posix_fadvise calls, and how we
pass parameters does impact the decision.
With the current situation where prefetch is done step by step, block by
block, they are very probably most of the time all loaded even if those
from the beginning of the relation can be discarded at the end of the
prefetch.

However, if instead you provide a real range, or the magic len=0 to
posix_fadvise, then blocks are "more" loaded according to effective vm
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this
is probably not what end-users expect despite being probably a good
choice (you can still free cache beforehand to help the kernel).

An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |        0
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |      320
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |      320 <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |    65536
      32768 |       32768 |    65536 |    65536
      65536 |       32768 |    65536 |    65536
      98304 |       32768 |    65536 |    65536
     131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

Thank you for your time reading this longer than expected email.

Comments ?

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

Attachment	Content-Type	Size
0001-Check-last-block-greater-or-equal-to-first-block-in-.patch	text/x-patch	1.1 KB
0002-Allow-pg_prewarm-to-use-new-smgrprefetch-range.patch	text/x-patch	1.9 KB
0003-Allow-pg_prewarm-to-use-new-smgrreadv.patch	text/x-patch	2.0 KB

Responses

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED at 2024-03-05 11:07:01 from Nazir Bilal Yavuz

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Matthias van de Meent	2024-01-03 23:23:50	Re: Reducing output size of nodeToString
Previous Message	Jim Nasby	2024-01-03 23:15:11	Re: Extension Enhancement: Buffer Invalidation in pg_buffercache