Re: index prefetching

From: "Peter Geoghegan" <pg(at)bowt(dot)ie>
To: "Andres Freund" <andres(at)anarazel(dot)de>
Cc: "Tomas Vondra" <tomas(at)vondra(dot)me>, "Thomas Munro" <thomas(dot)munro(at)gmail(dot)com>, "Nazir Bilal Yavuz" <byavuz81(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Melanie Plageman" <melanieplageman(at)gmail(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Georgios" <gkokolatos(at)protonmail(dot)com>, "Konstantin Knizhnik" <knizhnik(at)garret(dot)ru>, "Dilip Kumar" <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-08-14 20:12:40
Message-ID: DC2FB4TJBCYR.2U40BNPFOO51E@bowt.ie
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu Aug 14, 2025 at 3:41 PM EDT, Andres Freund wrote:
> Hm, that is somewhat curious.
>
> I wonder if there's some wait time that's not being captured by "I/O
> Timings". A first thing to do would be to just run strace --summary-only while
> running the query, and see if there are syscall wait times that seem too long.

For the slow, sequential backwards scan query:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.271216 4 66808 io_uring_enter
0.00 0.000004 4 1 sendto
0.00 0.000001 0 2 1 recvfrom
0.00 0.000000 0 5 lseek
0.00 0.000000 0 1 epoll_wait
0.00 0.000000 0 4 openat
------ ----------- ----------- --------- --------- ----------------
100.00 0.271221 4 66821 1 total

For the fast, random backwards scan query:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 0.351518 4 77819 io_uring_enter
0.00 0.000007 2 3 1 epoll_wait
0.00 0.000006 6 1 sendto
0.00 0.000003 1 3 2 recvfrom
0.00 0.000002 2 1 read
0.00 0.000002 2 1 1 rt_sigreturn
0.00 0.000002 2 1 getpid
0.00 0.000002 1 2 kill
0.00 0.000000 0 3 lseek
------ ----------- ----------- --------- --------- ----------------
100.00 0.351542 4 77834 4 total

> What effective_io_concurrency and io_max_concurrency setting are you using? If
> there are no free IO handles that's currently not nicely reported (because
> it's unclear how exactly to do so, see comment above pgaio_io_acquire_nb()).

effective_io_concurrency is 100. io_max_concurrency is 64. Nothing out of
the ordinary there.

> iostat -xmy /path/to/block/device
>
> I'd like to see the difference in average IO size (rareq-sz), queue depth
> (aqu-sz) and completion time (r_await) between the fast and slow cases.

I'll show one second interval output.

Slow, sequential backwards scan query
-------------------------------------

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 24613.00 192.29 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.92 53.20

avg-cpu: %user %nice %system %iowait %steal %idle
0.22 0.00 0.44 0.85 0.00 98.50

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 25320.00 197.81 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.18 51.20

Fast, random backwards scan query
---------------------------------

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 27140.59 212.04 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.50 23.37

avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.84 0.00 0.00 98.66

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 50401.00 393.76 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.06 41.60

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-08-14 20:40:25 Re: POC: Parallel processing of indexes in autovacuum
Previous Message Tom Lane 2025-08-14 19:45:41 Re: [PATCH] bms_prev_member() can read beyond the end of the array of allocated words