Re: Relation bulk write facility

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Relation bulk write facility
Date: 2026-07-01 14:18:12
Message-ID: 789656F9-C573-474B-8C44-56FCCC16A8D3@yandex-team.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

On 2023-11-19, Andres Freund wrote:
> One thing I'd like to use the centralized handling for is to track
> such writes in pg_stat_io. I don't mean as part of the initial patch,
> just that that's another reason I like the facility.

Coming back to this idea. Today the writes that build new relation
storage through bulk_write.c - a CREATE INDEX, a CLUSTER / VACUUM FULL
heap rewrite, ALTER TABLE SET TABLESPACE - go straight to smgrextend() /
smgrwrite() and never show up in pg_stat_io. So a plain CREATE INDEX on
a large table can write gigabytes that are completely invisible there.
The nearest trace is the fsync, and even that is partial: in the common
case the sync is deferred and the checkpointer's fsync shows up under
the normal context (attributed to the checkpointer, not to the backend
that did the build); and if a checkpoint runs concurrently the backend
does smgrimmedsync() instead, which is not counted anywhere at all.

This was foreseen while pg_stat_io was being built: the io_object column
was added to "pave the way for bypass IO" [0], and counting such writes
was deliberately deferred to a future central point (the smgr
wrappers discussed back then, later was implemented as Heikki's bulk
write facility). bulk_write.c now is exactly that point.

PFA a patch that does what was planned: it accounts for those writes
and extends from the one central place, smgr_bulk_flush(), in a new
IOCONTEXT_BYPASS context. The fsync path is left untouched (more on that
in point 3 below).

A few points open for discussion:

1. The context name. I went with "bypass", matching the language already
used in the docs ("bypasses shared buffers") and in this thread. It
also keeps it clearly apart from the strategy-ring "bulkwrite"
(BAS_BULKWRITE), which is a different thing. I am not attached to the
name, so other suggestions are welcome.

2. Scope. This patch tracks the write side only. The read side that
also bypasses shared buffers (e.g. the smgrread() in
RelationCopyStorage for a tablespace move) stays untracked; I left it
out to keep the first cut small, but it could be added the same way.

3. The fsync. The deferred sync is already visible (the checkpointer
counts it under normal); the immediate one - smgrimmedsync(), taken
when a checkpoint intervenes during the build - is counted nowhere.
md.c actually anticipates exactly this: a comment in mdimmedsync()
says such backend fsyncs should be tracked in a separate IOContext
from the checkpointer's, but that it was waiting until other IO that
bypasses the buffer manager is tracked too. This patch is that
prerequisite, so counting smgrimmedsync() under the new context is
the obvious next step; I left it out of this first cut to keep it
focused.

Temporary file I/O (sort/tuplestore spills, hash agg/join batches) is a
separate and larger story - it is non-block-oriented and goes through
buffile.c, not smgr, so it is out of scope of this patch. That was anticipated
when pg_stat_io was designed (the columns were left unprefixed to allow
non-block-oriented I/O later [1]); it would be a follow-up of its own.

Timing is counted as in the buffered paths (pgstat_prepare_io_time /
pgstat_count_io_op_time), so the *_time columns work when
track_io_timing is on.

WDYT?

Thank you!

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/CAOtHd0ApHna7_p6mvHoO%2BgLZdxjaQPRemg3_o0a4ytCPijLytQ%40mail.gmail.com
[1] https://www.postgresql.org/message-id/CAAKRu_ZiLuEPANqsHqqRPbgt4BTmgMqtPpyJJaTQxLs818tvKg%40mail.gmail.com

Attachment Content-Type Size
0001-Track-relation-writes-that-bypass-shared-buffers-in-.patch application/octet-stream 13.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Fujii Masao 2026-07-01 14:17:04 Re: Clear base backup progress reporting on error