Re: Automatically sizing the IO worker pool

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Automatically sizing the IO worker pool
Date: 2025-08-04 05:30:29
Message-ID: CA+hUKG+P80oR7yy5-67uHqwBWr9rux69BkQF9fwz=f0LuDn3Rw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 30, 2025 at 10:15 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Thanks. I was experimenting with this approach, and realized there isn't
> much metrics exposed about workers and the IO queue so far. Since the

Hmm. You can almost infer the depth from the pg_aios view. All IOs
in use are visible there, and the SUBMITTED ones are all either in the
queue, currently being executed by a worker, or being executed
synchronously by a regular backend because the queue was full and in
that case it just falls back to synchronous execution. Perhaps we
just need to be able to distinguish those three cases in that view.
For the synchronous-in-submitter overflow case, I think f_sync should
really show 't', and I'll post a patch for that shortly. For
"currently executing in a worker", I wonder if we could have an "info"
column that queries a new optional callback
pgaio_iomethod_ops->get_info(ioh) where worker mode could return
"worker 3", or something like that.

> worker pool growth is based on the queue size and workers try to share
> the load uniformly, it makes to have a system view to show those

Actually it's not uniform: it tries to wake up the lowest numbered
worker that advertises itself as idle, in that little bitmap of idle
workers. So if you look in htop you'll see that worker 0 is the most
busy, then worker 1, etc. Only if they are all quite busy does it
become almost uniform, which probably implies you've reached hit
io_max_workers and should probably set it higher (or without this
patch, you should probably just increase io_workers manually, assuming
your I/O hardware can take more).

Originally I made it like that to give higher numbered workers a
chance to time out (anticipating this patch). Later I found another
reason to do it that way:

When I tried uniform distribution using atomic_fetch_add(&distributor,
1) % nworkers to select the worker to wake up, avg(latency) and
stddev(latency) were both higher for simple tests like the one
attached to the first message, when running several copies of it
concurrently. The concentrate-into-lowest-numbers design benefits
from latch collapsing and allows the busier workers to avoid going
back to sleep when they could immediately pick up a new job. I didn't
change that in this patch, though I did tweak the "fan out" logic a
bit, after some experimentation on several machines where I realised
the code in master/18 is a bit over enthusiastic about that and has a
higher spurious wakeup ratio (something this patch actually measures
and tries to reduce).

Here is one of my less successful attempts to do a round-robin system
that tries to adjust the pool size with more engineering, but it was
consistently worse on those latency statistics compared to this
approach, and wasn't even as good at finding a good pool size, so
eventually I realised that it was a dead end and my original work
contrentrating concept was better:

https://github.com/macdice/postgres/tree/io-worker-pool

FWIW the patch in this branch is in this public branch:

https://github.com/macdice/postgres/tree/io-worker-pool-3

> Regarding the worker pool growth approach, it sounds reasonable to me.

Great to hear. I wonder what other kinds of testing we should do to
validate this, but I am feeling quite confident about this patch and
thinking it should probably go in sooner rather than later.

> With static number of workers one needs to somehow find a number
> suitable for all types of workload, where with this patch one needs only
> to fiddle with the launch interval to handle possible spikes. It would
> be interesting to investigate, how this approach would react to
> different dynamics of the queue size. I've plotted one "spike" scenario
> in the "Worker pool size response to queue depth", where there is a
> pretty artificial burst of IO, making the queue size look like a step
> function. If I understand the patch implementation correctly, it would
> respond linearly over time (green line), one could also think about
> applying a first order butterworth low pass filter to respond quicker
> but still smooth (orange line).

Interesting.

There is only one kind of smoothing in the patch currently, relating
to the pool size going down. It models spurious latch wakeups in an
exponentially decaying ratio of wakeups:work. That's the only way I
could find to deal with the inherent sloppiness of the wakeup
mechanism with a shared queue: when you wake the lowest numbered idle
worker as of some moment in time, it might lose the race against an
even lower numbered worker that finishes its current job and steals
the new job. When workers steal jobs, latency decreases, which is
good, so instead of preventing it I eventually figured out that we
should measure it, smooth it, and use it to limit wakeup propagation.
I wonder if that naturally produces curves a bit like your butterworth
line when it's going down already, but I'm not sure.

As for the curve on the way up, hmm, I'm not sure. Yes, it goes up
linearly and is limited by the launch delay, but I was thinking of
that only as the way it grows when the *variation* in workload changes
over a long time frame. In other words, maybe it's not so important
how exactly it grows, it's more important that it achieves a steady
state that can handle the oscillations and spikes in your workload.
The idle timeout creates that steady state by holding the current pool
size for quite a while, so that it can handle your quieter and busier
moments immediately without having to adjust the pool size.

In that other failed attempt I tried to model that more explicitly,
with "active" workers and "spare" workers, with the active set sizes
for average demand with uniform wakeups and the spare set sized for
some number of standard deviations that are woken up only when the
queue is high, but I could never really make it work well...

> But in reality the queue size would be of course much more volatile even
> on stable workloads, like in "Queue depth over time" (one can see
> general oscillation, as well as different modes, e.g. where data is in
> the page cache vs where it isn't). Event more, there is a feedback where
> increasing number of workers would accelerate queue size decrease --
> based on [1] the system utilization for M/M/k depends on the arrival
> rate, processing rate and number of processors, where pretty intuitively
> more processors reduce utilization. But alas, as you've mentioned this
> result exists for Poisson distribution only.

> Btw, I assume something similar could be done to other methods as well?
> I'm not up to date on io uring, can one change the ring depth on the
> fly?

Each backend's io_uring submission queue is configured at startup and
not changeable later, but it is sized for the maximum possible number
that each backend can submit, io_max_concurrency, which corresponds to
the backend's portion of the array of PgAioHandle objects that is
fixed. I suppose you could say that each backend's submission queue
can't overflow at that level, because it's perfectly sized and not
shared with other backends, or to put it another way, the equivalent
of overflow is we won't try to submit more IOs than that.

Worker mode has a shared submission queue, but falls back to
synchronous execution if it's full, which is a bit weird as it makes
your IOs jump the queue in a sense, and that is a good reason to want
this patch so that the pool can try to find the size that avoids that
instead of leaving the user in the dark.

As for the equivalent of pool sizing inside io_uring (and maybe other
AIO systems in other kernels), hmm.... in the absolute best cases
worker threads can be skipped completely, eg for direct I/O queued
straight to the device, but when used, I guess they have pretty
different economics. A kernel can start a thread just by allocating a
bit of memory and sticking it in a queue, and can also wake them (move
them to a different scheduler queue) cheaply, but we have to fork a
giant process that has to open all the files and build up its caches
etc. So I think they just start threads on demand immediately on need
without damping, with some kind of short grace period just to avoid
those smaller costs being repeated. I'm no expert on those internal
details, but our worker system clearly needs all this damping and
steady state discovery heuristics due to the higher overheads and
sloppy wakeups.

Thinking more about our comparatively heavyweight I/O workers, there
must also be affinity opportunities. If you somehow tended to use the
same workers for a given database in a cluster with multiple active
databases, then workers might accumulate fewer open file descriptors
and SMgrRelation cache objects. If you had per-NUMA node pools and
queues then you might be able to reduce contention, and maybe also
cache line ping-pong on buffer headers considering that the submitter
dirties the header, then the worker does (in the completion callback),
and then the submitter accesses it again. I haven't investigated
that.

> As a side note, I was trying to experiment with this patch using
> dm-mapper's delay feature to introduce an arbitrary large io latency and
> see how the io queue is growing. But strangely enough, even though the
> pure io latency was high, the queue growth was smaller than e.g. on a
> real hardware under the same conditions without any artificial delay. Is
> there anything obvious I'm missing that could have explained that?

Could it be alternating full and almost empty due to method_worker.c's
fallback to synchronous on overflow, which slows the submission down,
or something like that, and then you're plotting an average depth that
is lower than you expected? With the patch I'll share shortly to make
pg_aios show a useful f_sync value it might be more obvious...

About dm-mapper delays, I actually found it useful to hack up worker
mode itself to simulate storage behaviours, for example swamped local
disks or cloud storage with deep queues and no back pressure but
artificial IOPS and bandwidth caps, etc. I was thinking about
developing some proper settings to help with that kind of research:
debug_io_worker_queue_size (changeable at runtime),
debug_io_max_worker_queue_size (allocated at startup),
debug_io_worker_{latency,bandwidth,iops} to introduce calculated
sleeps, and debug_io_worker_overflow_policy=synchronous|wait so that
you can disable the synchronous fallback that confuses matters.
That'd be more convenient, portable and flexible than dm-mapper tricks
I guess. I'd been imagining that as a tool to investigate higher
level work on feedback control for read_stream.c as mentioned, but
come to think of it, it could also be useful to understand things
about the worker pool itself. That's vapourware though, for myself I
just used dirty hacks last time I was working on that stuff. In other
words, patches are most welcome if you're interested in that kind of
thing. I am a bit tied up with multithreading at the moment and time
grows short. I will come back to that problem in a little while and
that patch is on my list as part of the infrastructure needed to prove
things about the I/O stream feedback work I hope to share later...

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-08-04 05:38:18 Re: implement CAST(expr AS type FORMAT 'template')
Previous Message Corey Huinker 2025-08-04 05:09:21 Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions