Re: Streamify more code paths

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: Streamify more code paths
Date: 2026-03-12 15:35:48
Message-ID: CABPTF7UA3sEw1ZpAj8qAKY6Xs71sk41X-pV43_iZHZz2U_AP=Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 12, 2026 at 12:39 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote:
> > > Thanks for doing that. On my side, I am going to look at the gin and
> > > hash vacuum paths first with more testing as these don't use a custom
> > > callback. I don't think that I am going to need a lot of convincing,
> > > but I'd rather produce some numbers myself because doing something.
> > > I'll tweak a mounting point with the delay trick, as well.
> >
> > While debug_io_direct has been helping a bit, the trick for the delay
> > to throttle the IO activity has helped much more with my runtime
> > numbers. I have mounted a separate partition with a delay of 5ms,
> > disabled checkums (this part did not make a real difference), and
> > evicted shared buffers for relation and indexes before the VACUUM.
> >
> > Then I got better numbers. Here is an extract:
> > - worker=3:
> > gin_vacuum (100k tuples) base= 1448.2ms patch= 572.5ms 2.53x
> > ( 60.5%) (reads=175→104, io_time=1382.70→506.64ms)
> > gin_vacuum (300k tuples) base= 3728.0ms patch= 1332.0ms 2.80x
> > ( 64.3%) (reads=486→293, io_time=3669.89→1266.27ms)
> > bloom_vacuum (100k tuples) base= 21826.8ms patch= 17220.3ms 1.27x
> > ( 21.1%) (reads=485→117, io_time=4773.33→270.56ms)
> > bloom_vacuum (300k tuples) base= 67054.0ms patch= 53164.7ms 1.26x
> > ( 20.7%) (reads=1431.5→327.5, io_time=13880.2→381.395ms)
> > - io_uring:
> > gin_vacuum (100k tuples) base= 1240.3ms patch= 360.5ms 3.44x
> > ( 70.9%) (reads=175→104, io_time=1175.35→299.75ms)
> > gin_vacuum (300k tuples) base= 2829.9ms patch= 642.0ms 4.41x
> > ( 77.3%) (reads=465.5→293, io_time=2768.46→579.04ms)
> > bloom_vacuum (100k tuples) base= 22121.7ms patch= 17532.3ms 1.26x
> > ( 20.7%) (reads=485→117, io_time=4850.46→285.28ms)
> > bloom_vacuum (300k tuples) base= 67058.0ms patch= 53118.0ms 1.26x
> > ( 20.8%) (reads=1431.5→327.5, io_time=13870.9→305.44ms)
> >
> > The higher the number of tuples, the better the performance for each
> > individual operation, but the tests take a much longer time (tens of
> > seconds vs tens of minutes). For GIN, the numbers can be quite good
> > once these reads are pushed. For bloom, the runtime is improved, and
> > the IO numbers are much better.
> >
>
> -- io_uring, medium size
>
> bloom_vacuum_medium base= 8355.2ms patch= 715.0ms 11.68x
> ( 91.4%) (reads=4732→1056, io_time=7699.47→86.52ms)
> pgstattuple_medium base= 4012.8ms patch= 213.7ms 18.78x
> ( 94.7%) (reads=2006→2006, io_time=4001.66→200.24ms)
> pgstatindex_medium base= 5490.6ms patch= 37.9ms 144.88x
> ( 99.3%) (reads=2745→173, io_time=5481.54→7.82ms)
> hash_vacuum_medium base= 34483.4ms patch= 2703.5ms 12.75x
> ( 92.2%) (reads=19166→3901, io_time=31948.33→308.05ms)
> wal_logging_medium base= 7778.6ms patch= 7814.5ms 1.00x
> ( -0.5%) (reads=2857→2845, io_time=11.84→11.45ms)
>
> -- worker, medium size
> bloom_vacuum_medium base= 8376.2ms patch= 747.7ms 11.20x
> ( 91.1%) (reads=4732→1056, io_time=7688.91→65.49ms)
> pgstattuple_medium base= 4012.7ms patch= 339.0ms 11.84x
> ( 91.6%) (reads=2006→2006, io_time=4002.23→49.99ms)
> pgstatindex_medium base= 5490.3ms patch= 38.3ms 143.23x
> ( 99.3%) (reads=2745→173, io_time=5480.60→16.24ms)
> hash_vacuum_medium base= 34638.4ms patch= 2940.2ms 11.78x
> ( 91.5%) (reads=19166→3901, io_time=31881.61→242.01ms)
> wal_logging_medium base= 7440.1ms patch= 7434.0ms 1.00x
> ( 0.1%) (reads=2861→2825, io_time=10.62→10.71ms)
>

Our io_time metric currently measures only read time and ignores write
I/O, which can be misleading. We now separate it into read_time and
write_time.

-- write-delay 2 ms
WORKROOT=/srv/pg_delayed SIZES=small REPS=3
./run_streaming_benchmark.sh --baseline --io-method worker
--io-workers 12 --test hash_vacuum --direct-io --read-delay 2
--write-delay 2
v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch

hash_vacuum_small base= 16652.8ms patch= 13493.2ms 1.23x
( 19.0%) (reads=2338→815, read_time=4136.19→884.79ms,
writes=6218→6206, write_time=12313.81→12289.58ms)

-- write-delay 0 ms
WORKROOT=/srv/pg_delayed SIZES=small REPS=3
./run_streaming_benchmark.sh --baseline --io-method worker
--io-workers 12 --test hash_vacuum --direct-io --read-delay 2
--write-delay 0
v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch

hash_vacuum_small base= 4310.2ms patch= 1146.7ms 3.76x
( 73.4%) (reads=2338→815, read_time=4002.24→833.47ms,
writes=6218→6206, write_time=186.69→140.96ms)

--
Best,
Xuneng

Attachment Content-Type Size
run_streaming_benchmark.sh text/x-sh 34.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alastair Turner 2026-03-12 16:01:01 Re: [PATCH] libpq: try all addresses for a host before moving to next on target_session_attrs mismatch
Previous Message Andres Freund 2026-03-12 15:32:43 Re: Drop 32-bit support (was "Re: Fix typo 586/686 in atomics/arch-x86.h")