Quick Links

Re: Automatically sizing the IO worker pool

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Automatically sizing the IO worker pool
Date:	2026-04-07 19:01:32
Message-ID:	adaz4sd2x64uoei2nmd4pinvothusd7uwllnilq6msfp4wytrw@npedbcvg3llt
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-04-07 22:39:37 +1200, Thomas Munro wrote:

> > > @@ -1582,14 +1584,16 @@ DetermineSleepTime(void)
> > >
> > > return seconds * 1000;
> > > }
> > > - else
> > > - return 60 * 1000;
> > > }
> > >
> > > - if (StartWorkerNeeded)
> > > + /* Time of next maybe_start_io_workers() call, or 0 for none. */
> > > + next_wakeup = maybe_start_io_workers_scheduled_at();
> > > +
> > > + /* Ignore bgworkers during shutdown. */
> > > + if (StartWorkerNeeded && Shutdown == NoShutdown)
> > > return 0;
> >
> > Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
> > here?
>
> Seems OK? I mean sure I would to make this whole function more
> uniform in structure, see my second patch, but...

It's ok, there just doesn't seem to be a point in doing it before that if,
rather than just after...

> > > +static int
> > > +pgaio_worker_set_get_highest(PgAioWorkerSet *set)
> > > +{
> > > + Assert(!pgaio_worker_set_is_empty(set));
> > > + return pg_leftmost_one_pos64(*set);
> > > +}
> >
> > "worker_set_get*" reads quite awkwardly. Maybe just going for
> > pgaio_workerset_* would help?
> >
> > Or maybe just name it PgAioWset/pgaio_wset_ or such?
>
> OK let's try "workerset".

Looks better.

> > Maybe just name it pgaio_worker_set_grow()?
>
> OK how about:
>
> pgaio_worker_request_grow()
> pgaio_worker_cancel_grow()

WFM.

> > > @@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > > break;
> > > }
> > >
> > > - if (wakeup == NULL)
> > > - {
> > > - /* Choose an idle worker to wake up if we haven't already. */
> > > - worker = pgaio_worker_choose_idle();
> > > - if (worker >= 0)
> > > - wakeup = io_worker_control->workers[worker].latch;
> > > -
> > > - pgaio_debug_io(DEBUG4, staged_ios[i],
> > > - "choosing worker %d",
> > > - worker);
> > > - }
> > > + /* Choose one worker to wake for this batch. */
> > > + if (worker == -1)
> > > + worker = pgaio_worker_choose_idle(0);
> > > }
> >
> > If we only want to do this once per "batch", why not just do it outside the
> > num_staged_ios loop?
>
> Two steps: pgaio_worker_choose_idle() must be done while holding the
> queue lock (will probably finish up revising this in future work on
> removing locks...). pgaio_worker_wake() is called outside the loop,
> after releasing the lock.

I just meant doing it outside the for loop.

for (int i = 0; i < num_staged_ios; ++i)
{
Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
{
/*
* Do the rest synchronously. If the queue is full, give up
* and do the rest synchronously. We're holding an exclusive
* lock on the queue so nothing can consume entries.
*/
synchronous_ios = &staged_ios[i];
nsync = (num_staged_ios - i);

break;
}

/* Choose one worker to wake for this batch. */
if (worker == -1)
worker = pgaio_worker_choose_idle(-1);
}

The if (worker == -1) is done for every to be submitted IO. If there are no
idle workers, we'd redo the pgaio_worker_choose_idle() every time. ISTM it
should just be:

break;
}
}

/* Choose one worker to wake for this batch. */
if (worker == -1)
worker = pgaio_worker_choose_idle(-1);

> > > @@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > > static void
> > > pgaio_worker_die(int code, Datum arg)
> > > {
> > > [...]
> > > + /* Notify other workers on pool change. */
> >
> > Why are we notifying them on pool changes?
>
> Comments added to explain. It closes a wakeup-loss race (imagine if
> you consumed a wakeup while you were exiting due to timeout; noone
> else would wake up, which I fixed with this big hammer).

Thanks, looks a lot clearer now.

> > > +/*
> > > + * Check if this backend is allowed to time out, and thus should use a
> > > + * non-infinite sleep time. Only the highest-numbered worker is allowed to
> > > + * time out, and only if the pool is above io_min_workers. Serializing
> > > + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
> > > + * io_min_workers.
> >
> > But it's ok if a lower numbered worker errors out, right? There will be a
> > temporary gap, but we will start a new worker for it?
>
> Yes it is OK for there to be gaps.
>
> If any worker errors out, it will be replaced when reaped if we fell
> below io_min_workers, and otherwise replaced via the usual means, ie
> once the backlog detection and the launch delay allow it. I did have
> a version that always replaced *every* worker with exit code 1
> immediately, but I started wondering if we really want persistent
> errors to turn into high speed fork() loops. I'm still not sure TBH.
> We don't expect workers to error out, so it means something is already
> pretty screwed up and you might appreciate the rate limiting?

Yea, I think it's saner not to do that.

> > I think both 'wakeups" and "ios" are a bit too generically named. Based on the
> > names I have no idea what this heuristic might be.
>
> I have struggled to name them. Does wakeup_count and io_count help?

hist_wakeups, hist_ios?

> > > + /*
> > > + * If there were no idle higher numbered peers and there
> > > + * are more than enough IOs queued for me and all lower
> > > + * numbered peers, then try to start a new worker.
> > > + */
> > > + if (worker == -1 && queue_depth > MyIoWorkerId)
> > > + grow = true;
> > > + }
> >
> > We probably shouldn't request growth when already at the cap? That could
> > generate a *lot* of pmsignal traffic, I think?
>
> No, we only set it if it isn't already set (like a latch), and only
> send a pmsignal when we set it (like a latch), and the postmaster only
> clears it if it can start a worker (unlike a latch). That applies in
> general, not just when we hit the cap of io_max_workers: while the
> postmaster is waiting for launch interval to expire, it will leave the
> flag set, suppressed for 100ms or whatever, and the in the special
> case of io_max_workers, for as long as the count remains that high.

I'm quite certain that's not how it actually ended up working with the prior
version and the benchmark I showed, there indeed were a lot of requests to
postmaster. I think it's because pgaio_worker_cancel_grow() (forgot the old
name already) very frequently clears the flag, just for it to be immediately
set again.

Yep, still happens, does require the max to be smaller than 32 though.

While a lot of IO is happening, no new connections being started, and with
1781562 being postmaster's pid:

perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1

Performance counter stats for process id '1781562':

2,790 raw_syscalls:sys_enter

1.001872667 seconds time elapsed

2,814 raw_syscalls:sys_enter

1.001983049 seconds time elapsed

3,036 raw_syscalls:sys_enter

1.001705850 seconds time elapsed

2,982 raw_syscalls:sys_enter

1.001881364 seconds time elapsed

I think it may need a timestamp in the shared state to not allow another
postmaster wake until some time has elapsed, or something.

>
> I should have made it clearer that that's a secondary condition. The
> primary condition is: a worker wanted to wake another worker, but
> found that none were idle. Unfortunately the whole system is a bit
> too asynchronous for that to be a reliable cue on its own. So, I also
> check if the queue appears to be (1) obviously growing: that's clearly
> too long and must be introducing latency, or (2) varying "too much".
> Which I detect in exactly the same way.
>
> Imagine a histogram that look like this:
>
> LOG: depth 00: 7898
> LOG: depth 01: 1630
> LOG: depth 02: 308
> LOG: depth 03: 93
> LOG: depth 04: 40
> LOG: depth 05: 19
> LOG: depth 06: 6
> LOG: depth 07: 4
> LOG: depth 08: 0
> LOG: depth 09: 1
> LOG: depth 10: 1
> LOG: depth 11: 0
> LOG: depth 12: 0
> LOG: depth 13: 0
>
> If you're failing to find idle workers to wake up AND our managic
> threshold is hit by something in that long tail, then it'll call for
> backup. Of course I'm totally sidestepping a lot of queueing theory
> maths and just saying "I'd better be able to find an idle worker when
> I want to" and if not, "there had better not be any outliers that
> reach this far".
>
> I've written a longer explanation in a long comment. Including a
> little challenge for someone to do better with real science and maths.
> I hope it's a bit clearer at least.

Definitely good to have that comment. Have to ponder it for a bit.

> > ninja install-test-files
> > io_max_workers=32
> > debug_io_direct=data
> > effective_io_concurrency=16
> > shared_buffers=5GB
> >
> > pgbench -i -q -s 100 --fillfactor=30
> >
> > CREATE EXTENSION IF NOT EXISTS test_aio;
> > CREATE EXTENSION IF NOT EXISTS pg_buffercache;
> > DROP TABLE IF EXISTS pattern_random_pgbench;
> > CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;
> >
> > My test is:
> >
> > SET effective_io_concurrency = 20;
> > SELECT pg_buffercache_evict_relation('pgbench_accounts');
> > SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;
> >
> >
> > We end up with ~24-28 workers, even though we never have more than 20 IOs in
> > flight. Not entirely sure why. I guess it's just that after doing an IO the
> > worker needs to mark itself idle etc?
>
> Yep. It would be nice to make it a bit more accurate in later cycles.
> It tends to overprovision rather than under, since it thinks all other
> workers are busy.

I think that's the right direction to err into.

> That information is a bit racy.

Yea, I think that's fine.

> > Hm. This way you get very rapid worker pool reductions. Configured
> > io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
> > count after the work finishes:
> > ...
> > Of course this is a ridiculuously low setting, but it does seems like starting
> > the timeout even when not the highest numbered worker will lead to a lot of
> > quick yoyoing.
>
> I have changed it so that after one worker times out, the next one
> begins its timeout count from 0. (This is one of the reasons for that
> "notify the whole pool when I exit" thing.)

That looks much better in a quick test.

I've not again looked through the details, but based on a relatively short
experiment, the one problematic thing I see is the frequent postmaster
requests.

Greetings,

Andres Freund

In response to

Re: Automatically sizing the IO worker pool at 2026-04-07 10:39:37 from Thomas Munro

Responses

Re: Automatically sizing the IO worker pool at 2026-04-07 23:18:51 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Matthias van de Meent	2026-04-07 19:38:37	Re: Better shared data structure management and resizable shared data structures
Previous Message	Tom Lane	2026-04-07 18:57:40	Re: domain for WITHOUT OVERLAPS