| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
| Cc: | pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Streamify more code paths |
| Date: | 2026-02-09 10:40:59 |
| Message-ID: | CABPTF7VtSYmC5LZSnkJWYn9PCkxgOJd9QbtAM79qftBK-fbA4w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > >
> > > Hi,
> > >
> > > Thanks for looking into this.
> > >
> > > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
> > > > >
> > > > > Hi,
> > > > > >
> > > > > > Two more to go:
> > > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > > >
> > > > > > Benchmarks will be conducted soon.
> > > > > >
> > > > >
> > > > > v6 in the last message has a problem and has not been updated. Attach
> > > > > the right one again. Sorry for the noise.
> > > >
> > > > 0003 and 0006:
> > > >
> > > > You need to add 'StatApproxReadStreamPrivate' and
> > > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> > >
> > > Done.
> > >
> > > > 0005:
> > > >
> > > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
> > > > nbufs = 0;
> > > > while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > > > {
> > > > - Buffer buf = ReadBufferExtended(rel, forknum, blkno,
> > > > - RBM_NORMAL, NULL);
> > > > + Buffer buf = read_stream_next_buffer(stream, NULL);
> > > > +
> > > > + if (!BufferIsValid(buf))
> > > > + break;
> > > >
> > > > We are loosening a check here, there should not be a invalid buffer in
> > > > the stream until the endblk. I think you can remove this
> > > > BufferIsValid() check, then we can learn if something goes wrong.
> > >
> > > My concern before for not adding assert at the end of streaming is the
> > > potential early break in here:
> > >
> > > /* Nothing more to do if all remaining blocks were empty. */
> > > if (nbufs == 0)
> > > break;
> > >
> > > After looking more closely, it turns out to be a misunderstanding of the logic.
> > >
> > > > 0006:
> > > >
> > > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > > can use the same stream with different variables, I believe this is
> > > > the preferred way.
> > > >
> > > > Rest LGTM!
> > > >
> > >
> > > Yeah, reset seems a more proper way here.
> > >
> >
> > Run pgindent using the updated typedefs.list.
> >
>
> I've completed benchmarking of the v4 streaming read patches across
> three I/O methods (io_uring, sync, worker). Tests were run with cold
> cache on large datasets.
>
> --- Settings ---
>
> shared_buffers = '8GB'
> effective_io_concurrency = 200
> io_method = $IO_METHOD
> io_workers = $IO_WORKERS
> io_max_concurrency = $IO_MAX_CONCURRENCY
> track_io_timing = on
> autovacuum = off
> checkpoint_timeout = 1h
> max_wal_size = 10GB
> max_parallel_workers_per_gather = 0
>
> --- Machine ---
> CPU: 48-core
> RAM: 256 GB DDR5
> Disk: 2 x 1.92 TB NVMe SSD
>
> --- Executive Summary ---
>
> The patches provide significant benefits for I/O-bound sequential
> operations, with the greatest improvements seen when using
> asynchronous I/O methods (io_uring and worker). The synchronous I/O
> mode shows reduced but still meaningful gains.
>
> --- Results by I/O Method
>
> Best Results: io_method=worker
>
> bloom_scan: 4.14x (75.9% faster); 93% fewer reads
> pgstattuple: 1.59x (37.1% faster); 94% fewer reads
> hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
> gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
> bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
> wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads
>
> io_method=io_uring
>
> bloom_scan: 3.12x (68.0% faster); 93% fewer reads
> pgstattuple: 1.50x (33.2% faster); 94% fewer reads
> hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
> gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
> bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
> wal_logging: 1.00x (-0.5%, neutral); no change in reads
>
> io_method=sync (baseline comparison)
>
> bloom_scan: 1.20x (16.4% faster); 93% fewer reads
> pgstattuple: 1.10x (9.0% faster); 94% fewer reads
> hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
> gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
> bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
> wal_logging: 0.99x (-0.7%, neutral); no change in reads
>
> --- Observations ---
>
> Async I/O amplifies streaming benefits: The same patches show 3-4x
> improvement with worker/io_uring vs 1.2x with sync.
>
> I/O operation reduction is consistent: All modes show the same ~93-94%
> reduction in I/O operations for bloom_scan and pgstattuple.
>
> VACUUM operations show modest gains: Despite large I/O reductions
> (76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
> larger CPU overhead (tuple processing, index maintenance, WAL
> logging).
>
> log_newpage_range shows no benefit: The patch provides no improvement (~0.97x).
>
> --
> Best,
> Xuneng
There was an issue in the wal_log test of the original script.
--- The original benchmark used:
ALTER TABLE ... SET LOGGED
This path performs a full table rewrite via ATRewriteTable()
(tablecmds.c). It creates a new relfilenode and copies tuples into it.
It does not call log_newpage_range() on rewritten pages.
log_newpage_range() may only appear indirectly through the
pending-sync logic in storage.c, and only when:
wal_level = minimal, and
relation size < wal_skip_threshold (default 2MB).
Our test tables (1M–20M rows) are far larger than 2MB. In that case,
PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
previous benchmark measured table rewrite I/O, not the
log_newpage_range() path.
--- Current design: GIN index build
The benchmark now uses:
CREATE INDEX ... USING gin (doc_tsv)
This reliably exercises log_newpage_range() because:
- ginbuild() constructs the index and WAL-logs all new index pages
using log_newpage_range().
- This is part of the normal GIN build path, independent of wal_skip_threshold.
- The streaming-read patch modifies the WAL logging path inside
log_newpage_range(), which this test directly targets.
--- Results (wal_logging_large)
worker: 1.00x (+0.5%); no meaningful change in reads
io_uring: 1.01x (+1.3%); no meaningful change in reads
sync: 1.01x (+1.1%); no meaningful change in reads
--
Best,
Xuneng
| Attachment | Content-Type | Size |
|---|---|---|
| run_streaming_benchmark.sh | text/x-sh | 27.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Vitaly Davydov | 2026-02-09 10:42:56 | Re: Support logical replication of DDLs |
| Previous Message | Heikki Linnakangas | 2026-02-09 10:14:47 | Re: Buffer locking is special (hints, checksums, AIO writes) |