From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Should io_method=worker remain the default? |
Date: | 2025-09-03 19:31:38 |
Message-ID: | fgmc4pes6d3rpcdgojxwvq7uwiqxthrneoh5zjtx2uda3tyty2@qwfvvb6z3b5p |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-09-03 11:50:05 -0700, Jeff Davis wrote:
> On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote:
> > 32 parallel seq scans of a large relations, with default shared
> > buffers, fully
> > cached in the OS page cache, seems like a pretty absurd workload.
>
> It's the default settings, and users often just keep going with the
> defaults as long as it works, giving little thought to any kind of
> tuning or optimization until they hit a wall. Fully cached data is
> common, as are scan-heavy workloads. Calling it "absurd" is an
> exaggeration.
I agree that an unconfigured postgres is a common thing. But I don't think
that means that doing 30GB/s of IO from 32 backends is something that
workloads using an untuned postgres will do. I cannot come up with any
halfway realistic scenarios where you'd do anywhere this much cached IO.
As soon as you don't do quite as much IO, the entire bottleneck the test
exercises *completely* vanishes. There's like 20-30 instructions covered by
that lwlock. You need a *lot* of invocations for that to become a bottlneck.
> > That's not to say we shouldn't spend some effort to avoid regressions for
> > it, but it also doesn't seem to be worth focusing all that much on it.
>
> Fair, but we should acknowledge the places where the new defaults do better
> vs worse, and provide some guidance on what to look for and how to tune it.
If you find describable realistic workload that regress, I'm all ears. Both
from the perspective of fixing regressions and documenting how to choose the
correct io_method. But providing tuning advice for a very extreme workload
that only conceivably exists with an untuned postgres doesn't seem likely to
help anybody.
> We should also not be in too much of a rush to get rid of "sync" mode until
> we have a better idea about where the tradeoffs are.
Nobody is in a rush to do so, from what I can tell? I don't know why you're so
focused on this.
> > Or is there a real-world scenario this actually emulating?
>
> This test was my first try at reproducing a smaller (but still
> noticeable) regression seen on a more realistic benchmark. I'm not 100%
> sure whether I reproduced the same effect or a different one, but I
> don't think we should dismiss it so quickly.
From what I can tell here, the regression that this benchmark observes is
entirely conditional on *extremely* high volumes of IO being issued by a lot
of backends.
What was the workload that hit the smaller regression?
> > *If* we actually care about this workload, we can make
> > pgaio_worker_submit_internal() acquire that lock conditionally, and
> > perform
> > the IOs synchronously instead.
>
> I like the idea of some kind of fallback for multiple reasons. I
> noticed that if I set io_workers=1, and then I SIGSTOP that worker,
> then sequential scans make no progress at all until I send SIGCONT. A
> fallback to synchronous sounds more robust, and more similar to what we
> do with walwriter and bgwriter. (That may be 19 material, though.)
I don't think what I'm proposing would really change anything in such a
scenario, unless you were unlucky enough to send the SIGSTOP in the very short
window in which the worker held the lwlock.
We already fall back to synchronuous IO when the queue towards the IO workers
is full. But I don't see a way to identify the case of "the queue is not full,
but the worker isn't making enough progress".
> > But I'm really not sure doing > 30GB/s of repeated reads from the
> > page cache
> > is a particularly useful thing to optimize.
>
> A long time ago, the expectation was that Postgres might be running on
> a machine along with other software, and perhaps many instances of
> Postgres on the same machine. In that case, low shared_buffers compared
> with the overall system memory makes sense, which would cause a lot of
> back-and-forth into shared buffers. That was also the era of magnetic
> disks, where such memory copies seemed almost free by comparison --
> perhaps we just don't care about that case any more?
I think we should still care about performing reasonably when data primarily
is cached in the OS page cache. I just don't think the benchmark is a good
example of such workloads - if you need to do data analysis of > 30GB/s of
data second, while the data is also cached, you can configure postgres at
least somewhat reasonably. Even if you were to somehow analyze this much
data, any realistic workload will actually do more than just count(*), which
means AioWorkerSubmissionQueueLock won't be the bottleneck.
> > If I instead just increase s_b, I get 2x the throughput...
>
> Increase to what? I tried a number of settings. Obviously >32GB makes
> it a non-issue because everything is cached. Values between 128MB and
> 32GB didn't seem to help, and were in some cases lower, but I didn't
> look into why yet. It might have something to do with crowding out the
> page cache.
I meant above ~32GB.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2025-09-03 19:33:30 | Re: index prefetching |
Previous Message | Arseniy Mukhin | 2025-09-03 19:28:21 | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |