Re: Use streaming read I/O in BRIN vacuuming

From: Arseniy Mukhin <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use streaming read I/O in BRIN vacuuming
Date: 2025-10-13 14:54:51
Message-ID: CAE7r3MLCU68tKC0-3d1iKV-3mYpDy5O7ireaFQsAOsyJEFtPXQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, Oct 9, 2025 at 2:03 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Sun, Aug 31, 2025 at 12:48 PM Arseniy Mukhin
> <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com> wrote:
> >
> > Hi!
> >
> > On Sun, Aug 31, 2025 at 8:49 PM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> > >
> > > Hi!
> > >
> > > > On 31 Aug 2025, at 21:17, Arseniy Mukhin <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com> wrote:
> > > >
> > > > PFA the patch that migrates BRIN vacuum to the read stream API.
> > >
> > > The patch is nice and straightforward. Looks good to me.
> > >
> >
> > Thank you for the review!
> >
> > > Some notes that do not seem to me problem of this patch:
> > > 1. This comment is copied 7 times already across codebase.
> > > "It is safe to use batchmode as block_range_read_stream_cb"
> > > Maybe we can refactor comments and function names...
> >
> > Yes, I had similar thoughts, but having these comments at callsites
> > has its own benefits, there is a thread about these comments [0]...
> >
> > > 2. Somehow brin_vacuum_scan() avoid dance of getting RelationGetNumberOfBlocks() many times to be entirely sure everything is scanned. Unlike other index vacuums, see btvacuumscan() for example.
> >
> > If I understand correctly, in other access methods you need to be sure
> > that you read the relation up to the end, so you don't leave any index
> > tuples that should be pruned. BRIN doesn't have a prune phase, there
> > is only a cleanup phase. So it seems it's not a big deal if you miss
> > several pages that were allocated during the vacuum.
> >
>
> Thank you for proposing the patch! I've reviewed the patch and have
> some comments:

Thank you for the review!

>
> + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE |
> + READ_STREAM_FULL |
> + READ_STREAM_SEQUENTIAL |
> + READ_STREAM_USE_BATCHING,
> + strategy,
> + idxrel,
> + MAIN_FORKNUM,
> + block_range_read_stream_cb,
> + &p,
> + 0);
>
> Unlike other index AM's it uses READ_STREAM_SEQUENTIAL. If there are
> some reasons to use it, we should leave comments there.

Good point, thank you. I looked again at the usage of the
READ_STREAM_SEQUENTIAL flag, and I'm leaning toward not using it here.
But I'm not completely sure, so I decided to ask about the flag usage
in the thread [0].

>
> ---
> Have you done any performance testing with this patch? Since BRIN
> indexes typically don't grow very large, I'm curious how much benefit
> the read_stream provides for BRIN index cleanup.

Yeah, BRIN is known as a compact index, so these changes should have
less impact, but I thought it should still be useful, just on a
smaller scale. I didn't run any measurements, as the changes are
similar to those made to other index AMs, where this was considered an
improvement. But after your question, I decided to try running some
benchmarks.

------------------------------------------
Performance testing
------------------------------------------

Setup & data (fill_data.sql)

shared_buffers = 1Gb
Table size: ~970Mb
Index size: ~30Mb (BRIN index artificially was made quite big, but it
lets us see patch impact better).

Test cases

1) Cold data - restart db, drop page cache, table seqscan, run vacuum.
This way we read index data from the disk (restart_with_cache_drop.sh)
2) Page cache - restart db, table seqscan, run vacuum. This way we
read index data from the page cache (restart_with_wo_cache_drop.sh)
3) Shared buffers - run vacuum with data in shared buffers (use query
with bitmap scan to warm cache) (vacuum_cached.sh)

Every io_method was used with every test case. I also tried to remove
READ_STREAM_SEQUENTIAL and measured it too (it's called
patched_wo_seq).

What were measured:

1) Duration of brin_vacuum_scan (without FreeSpaceMapVacuum(idxrel))
2) Avg read rate from vacuum report. Index has 4260 pages, and the
vacuum report says 4300 pages were read, so it looks like almost all
read rate is from brin_vacuum_scan().

Results

--- Cold data ---
Method | Version | Duration (ms) | Throughput (MB/s)
----------------------------------------------------------------------
io_uring | master | 14.6 ± 0.8 | 2068.1 ± 127.6
| patched | 9.9 ± 1.3 | 2947.3 ± 363.0
| patched_wo_seq | 10.5 ± 1.5 | 2817.5 ± 359.7
sync | master | 16.1 ± 1.9 | 1921.6 ± 211.6
| patched | 15.4 ± 1.6 | 1985.9 ± 182.3
| patched_wo_seq | 13.9 ± 1.2 | 2155.7 ± 180.4
worker | master | 16.4 ± 1.6 | 1850.0 ± 168.6
| patched | 9.2 ± 0.5 | 3077.3 ± 154.3
| patched_wo_seq | 9.0 ± 0.5 | 3133.3 ± 158.1

--- Page cache ---

Method | Version | Duration (ms) | Throughput (MB/s)
----------------------------------------------------------------------
io_uring | master | 10.4 ± 1.7 | 2971.3 ± 469.6
| patched | 7.3 ± 1.7 | 4152.7 ± 910.6
| patched_wo_seq | 7.0 ± 1.4 | 4208.0 ± 742.2
sync | master | 10.4 ± 2.3 | 3037.9 ± 655.6
| patched | 9.2 ± 1.5 | 3319.8 ± 646.7
| patched_wo_seq | 9.1 ± 1.6 | 3384.7 ± 674.9
worker | master | 10.2 ± 1.7 | 3013.4 ± 490.0
| patched | 3.0 ± 0.2 | 7878.6 ± 540.3
| patched_wo_seq | 3.0 ± 0.3 | 8127.3 ± 1093.7

--- Shared buffers ---

Method | Version | Duration (ms) | Throughput (MB/s)
----------------------------------------------------------------------
io_uring | master | 3.1 ± 0.3 | 0.0 ± 0.0
| patched | 3.2 ± 0.5 | 0.0 ± 0.0
| patched_wo_seq | 3.3 ± 0.4 | 0.0 ± 0.0
sync | master | 3.1 ± 0.4 | 0.0 ± 0.0
| patched | 3.3 ± 0.5 | 0.0 ± 0.0
| patched_wo_seq | 3.1 ± 0.5 | 0.0 ± 0.0
worker | master | 3.2 ± 0.4 | 0.0 ± 0.0
| patched | 3.4 ± 0.3 | 0.0 ± 0.0
| patched_wo_seq | 3.6 ± 0.3 | 0.0 ± 0.0

Looks like we have a win for 'cold data' and 'page cache' cases with
worker and io_uring, and some modest improvement for sync.
READ_STREAM_SEQUENTIAL doesn't seem to change anything. I don't have
much benchmarking experience, so hope I didn't make any bad
mistakes.I'm new to benchmarking, so I hope I haven't made any serious
mistakes.

Thank you!

[0] https://www.postgresql.org/message-id/CAE7r3MJ5NHb1BMo1oCNWmfrG%3DYtx-GKX44YNdA21FuQQQeq_Qg%40mail.gmail.com

Best regards,
Arseniy Mukhin

Attachment Content-Type Size
restart_with_wo_cache_drop.sh application/x-shellscript 474 bytes
restart_with_cache_drop.sh application/x-shellscript 470 bytes
vacuum_cached.sh application/x-shellscript 310 bytes
fill_data.sql application/sql 577 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2025-10-13 14:57:37 Re: Build failure with Meson >= 1.8.3 on Windows
Previous Message jian he 2025-10-13 14:22:24 Re: add function argument name to substring and substr