Re: Automatically sizing the IO worker pool

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Automatically sizing the IO worker pool
Date: 2026-04-07 23:18:51
Message-ID: CA+hUKGJi6L2ftuTiKw2W4AAah6Uivs4JGiq58A7MjePbXpc4BA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> The if (worker == -1) is done for every to be submitted IO. If there are no
> idle workers, we'd redo the pgaio_worker_choose_idle() every time. ISTM it
> should just be:
>
> for (int i = 0; i < num_staged_ios; ++i)
> {
> Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
> if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
> {
> /*
> * Do the rest synchronously. If the queue is full, give up
> * and do the rest synchronously. We're holding an exclusive
> * lock on the queue so nothing can consume entries.
> */
> synchronous_ios = &staged_ios[i];
> nsync = (num_staged_ios - i);
>
> break;
> }
> }
>
> /* Choose one worker to wake for this batch. */
> if (worker == -1)
> worker = pgaio_worker_choose_idle(-1);

Well I didn't want to wake a worker if we'd failed to enqueue
anything. Ahh, I could put it there and test nsync. Or I guess I
could just do it anyway. Considering that.

> > > I think both 'wakeups" and "ios" are a bit too generically named. Based on the
> > > names I have no idea what this heuristic might be.
> >
> > I have struggled to name them. Does wakeup_count and io_count help?
>
> hist_wakeups, hist_ios?

Thanks, that's a good name.

> > No, we only set it if it isn't already set (like a latch), and only
> > send a pmsignal when we set it (like a latch), and the postmaster only
> > clears it if it can start a worker (unlike a latch). That applies in
> > general, not just when we hit the cap of io_max_workers: while the
> > postmaster is waiting for launch interval to expire, it will leave the
> > flag set, suppressed for 100ms or whatever, and the in the special
> > case of io_max_workers, for as long as the count remains that high.
>
> I'm quite certain that's not how it actually ended up working with the prior
> version and the benchmark I showed, there indeed were a lot of requests to
> postmaster. I think it's because pgaio_worker_cancel_grow() (forgot the old
> name already) very frequently clears the flag, just for it to be immediately
> set again.
>
>
> Yep, still happens, does require the max to be smaller than 32 though.
>
> While a lot of IO is happening, no new connections being started, and with
> 1781562 being postmaster's pid:
>
> perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
>
> Performance counter stats for process id '1781562':
>
> 2,790 raw_syscalls:sys_enter
>
> 1.001872667 seconds time elapsed
>
> 2,814 raw_syscalls:sys_enter
>
> 1.001983049 seconds time elapsed
>
> 3,036 raw_syscalls:sys_enter
>
> 1.001705850 seconds time elapsed
>
> 2,982 raw_syscalls:sys_enter
>
> 1.001881364 seconds time elapsed
>
>
> I think it may need a timestamp in the shared state to not allow another
> postmaster wake until some time has elapsed, or something.

Hnng. Studying...

> > I should have made it clearer that that's a secondary condition. The
> > primary condition is: a worker wanted to wake another worker, but
> > found that none were idle. Unfortunately the whole system is a bit
> > too asynchronous for that to be a reliable cue on its own. So, I also
> > check if the queue appears to be (1) obviously growing: that's clearly
> > too long and must be introducing latency, or (2) varying "too much".
> > Which I detect in exactly the same way.
> >
> > Imagine a histogram that look like this:
> >
> > LOG: depth 00: 7898
> > LOG: depth 01: 1630
> > LOG: depth 02: 308
> > LOG: depth 03: 93
> > LOG: depth 04: 40
> > LOG: depth 05: 19
> > LOG: depth 06: 6
> > LOG: depth 07: 4
> > LOG: depth 08: 0
> > LOG: depth 09: 1
> > LOG: depth 10: 1
> > LOG: depth 11: 0
> > LOG: depth 12: 0
> > LOG: depth 13: 0
> >
> > If you're failing to find idle workers to wake up AND our managic
> > threshold is hit by something in that long tail, then it'll call for
> > backup. Of course I'm totally sidestepping a lot of queueing theory
> > maths and just saying "I'd better be able to find an idle worker when
> > I want to" and if not, "there had better not be any outliers that
> > reach this far".
> >
> > I've written a longer explanation in a long comment. Including a
> > little challenge for someone to do better with real science and maths.
> > I hope it's a bit clearer at least.
>
> Definitely good to have that comment. Have to ponder it for a bit.

Let me try again.

Our goal is simple: process every IO immediately. We have immediate
feedback that is simple: there's an IO in the queue and there is no
idle worker. The only action we can take is simple: add one more
worker. So we don't need to suffer through the maths required to
figure out the ideal k for our M/G/k queue system (I think that's what
we have?) or any of the inputs that would require*. The problem is
that on its own, the test triggered far too easily because a worker
that is not marked idle might in fact be just about to pick up that IO
on the one the one hand, and because there might be rare
spikes/clustering on the other, so I cooled it off a bit by
additionally testing if the queue appears to be growing or spiking
beyond some threshold. I think it's OK to let the queue grow a bit
before we are triggered anyway, so the precise value used doesn't seem
too critical. Someone might be able to come up with a more defensible
value, but in the end I just wanted a value that isn't triggered by
the outliers I see in real systems that are keeping up. We could tune
it lower and overshoot more, but this setting seems to work pretty
well. It doesn't seem likely that a real system could achieve a
steady state that is introducing latency but isn't increasing over
time, and pool size adjustments are bound to lag anyway.

* It's probably quite hard for call centres to figure out the number
of agents required to make you wait for a certain length of time, but
it's easy to know if you had to wait and you wish they had more!

> I've not again looked through the details, but based on a relatively short
> experiment, the one problematic thing I see is the frequent postmaster
> requests.

Looking into that...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mihail Nikalayeu 2026-04-07 23:19:00 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Previous Message Tom Lane 2026-04-07 23:08:34 Buildfarm misses running some contrib TAP tests