Re: AIO v2.5

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Subject: Re: AIO v2.5
Date: 2025-07-14 18:36:49
Message-ID: brdaw5wke274lubirrl4v2k4qdacylvgwwqztifn7m27pkth3s@rh7wie47pfcp
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote:
> I've been running some benchmarks comparing the io_methods, to help with
> resolving this PG18 open item. So here are some results, and my brief
> analysis of it.

Thanks for doing that!

> The TL;DR version
> -----------------
>
> * The "worker" method seems good, and I think we should keep it as a
> default. We should probably think about increasing the number of workers
> a bit, the current io_workers=3 seems to be too low and regresses in a
> couple tests.
>
> * The "sync" seems OK too, but it's more of a conservative choice, i.e.
> more weight for keeping the PG17 behavior / not causing regressions. But
> I haven't seen that (with enough workers). And there are cases when the
> "worker" is much faster. It'd be a shame to throw away that benefit.
>
> * There might be bugs in "worker", simply because it has to deal with
> multiple concurrent processes etc. But I guess we'll fix those just like
> other bugs. I don't think it's a good argument against "worker" default.
>
> * All my tests were done on Linux and NVMe drives. It'd be good to do
> similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
> to do some of that, but it'd be great to cover more cases. I can help
> with getting my script running, a run takes ~1-2 days.

FWIW, in my very limited tests on windows, the benefit of worker was
considerably bigger there, due to having much more minimal readahead not
having posix_fadvise...

> The test also included PG17 for comparison, but I forgot PG18 enabled
> checksums by default. So PG17 results are with checksums off, which in
> some cases means PG17 seems a little bit faster. I'm re-running it with
> checksums enabled on PG17, and that seems to eliminate the differences
> (as expected).

My sneaking suspicion is that, independent of AIO, we're not really ready to
default to checksums defaulting to on.

> Findings
> --------
>
> I'm attaching only three PDFs with charts from the cold runs, to keep
> the e-mail small (each PDF is ~100-200kB). Feel free to check the other
> PDFs in the git repository, but it's all very similar and the attached
> PDFs are quite representative.
>
> Some basic observations:
>
> a) index scans
>
> There's almost no difference for indexscans, i.e. the middle column in
> the PDFs. There's a bit of variation on some of the cyclic/linear data
> sets, but it seems more like random noise than a systemic difference.
>
> Which is not all that surprising, considering index scans don't really
> use read_stream yet, so there's no prefetching etc.

Indeed.

> The "ryzen" results however demonstrate that 3 workers may be too low.
> The timing spikes to ~3000ms (at ~1% selectivity), before quickly
> dropping back to ~1000ms. The other datasets show similar difference.
> With 12 workers, there's no such problem.

I don't really know what to do about that - for now we don't have dynamic
#workers, and starting 12 workers on a tiny database doesn't really make
sense... I suspect that on most hardware & queries it won't matter that much,
but clearly, if you have high iops hardware it might. I can perhaps see
increasing the default to 5 or so, but after that... I guess we could try
some autoconf formula based on the size of s_b or such? But that seems
somewhat awkward too.

>
> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)
>
> There's an interesting difference difference I noticed in the run with
> checksums on PG17. The full PDF is available here:

(there's a subsequent email about this, will reply there)

> Conclusion
> ----------
>
> That's all I have at the moment. I still think it makes sense to keep
> io_method=worker, but bump up the io_workers a bit higher. Could we also
> add some suggestions how to pick a good value to the docs?

.oO(/me ponders a troll patch to re-add a reference the number of spindles in
that tuning advice)

I'm not sure what advice to give here. Maybe just to set it to a considerably
larger number once not running on a tiny system? The incremental overhead of
having an idle worker is rather small unless you're on a really tiny system...

> You might also run the benchmark on different hardware, and either
> build/publish the plots somewhere, or just give me the CSV and I'll do
> that. Better to find strange stuff / regressions now.

Probably the most interesting thing would be some runs with cloud-ish storage
(relatively high iops, very high latency)...

> The repository also has branches with plots showing results with WIP
> indexscan prefetching. (It's excluded from the PDFs I presented here).

Hm, I looked for those, but I couldn't quickly find any plots that include
them. Would I have to generate the plots from a checkout of the repo?

> The conclusions are similar to what we found here - "worker" is good
> with enough workers, io_uring is good too. Sync has issues for some of
> the data sets, but still helps a lot.

Nice.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-07-14 18:41:49 Re: Disable parallel query by default
Previous Message Álvaro Herrera 2025-07-14 18:23:58 Re: pg_dump does not dump domain not-null constraint's comments