| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> | 
| Cc: | Antonin Houska <ah(at)cybertec(dot)at>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> | 
| Subject: | Re: AIO v2.5 | 
| Date: | 2025-07-14 18:36:49 | 
| Message-ID: | brdaw5wke274lubirrl4v2k4qdacylvgwwqztifn7m27pkth3s@rh7wie47pfcp | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote:
> I've been running some benchmarks comparing the io_methods, to help with
> resolving this PG18 open item. So here are some results, and my brief
> analysis of it.
Thanks for doing that!
> The TL;DR version
> -----------------
> 
> * The "worker" method seems good, and I think we should keep it as a
> default. We should probably think about increasing the number of workers
> a bit, the current io_workers=3 seems to be too low and regresses in a
> couple tests.
> 
> * The "sync" seems OK too, but it's more of a conservative choice, i.e.
> more weight for keeping the PG17 behavior / not causing regressions. But
> I haven't seen that (with enough workers). And there are cases when the
> "worker" is much faster. It'd be a shame to throw away that benefit.
> 
> * There might be bugs in "worker", simply because it has to deal with
> multiple concurrent processes etc. But I guess we'll fix those just like
> other bugs. I don't think it's a good argument against "worker" default.
> 
> * All my tests were done on Linux and NVMe drives. It'd be good to do
> similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
> to do some of that, but it'd be great to cover more cases. I can help
> with getting my script running, a run takes ~1-2 days.
FWIW, in my very limited tests on windows, the benefit of worker was
considerably bigger there, due to having much more minimal readahead not
having posix_fadvise...
> The test also included PG17 for comparison, but I forgot PG18 enabled
> checksums by default. So PG17 results are with checksums off, which in
> some cases means PG17 seems a little bit faster. I'm re-running it with
> checksums enabled on PG17, and that seems to eliminate the differences
> (as expected).
My sneaking suspicion is that, independent of AIO, we're not really ready to
default to checksums defaulting to on.
> Findings
> --------
> 
> I'm attaching only three PDFs with charts from the cold runs, to keep
> the e-mail small (each PDF is ~100-200kB). Feel free to check the other
> PDFs in the git repository, but it's all very similar and the attached
> PDFs are quite representative.
> 
> Some basic observations:
> 
> a) index scans
> 
> There's almost no difference for indexscans, i.e. the middle column in
> the PDFs. There's a bit of variation on some of the cyclic/linear data
> sets, but it seems more like random noise than a systemic difference.
>
> Which is not all that surprising, considering index scans don't really
> use read_stream yet, so there's no prefetching etc.
Indeed.
> The "ryzen" results however demonstrate that 3 workers may be too low.
> The timing spikes to ~3000ms (at ~1% selectivity), before quickly
> dropping back to ~1000ms. The other datasets show similar difference.
> With 12 workers, there's no such problem.
I don't really know what to do about that - for now we don't have dynamic
#workers, and starting 12 workers on a tiny database doesn't really make
sense...  I suspect that on most hardware & queries it won't matter that much,
but clearly, if you have high iops hardware it might.  I can perhaps see
increasing the default to 5 or so, but after that...  I guess we could try
some autoconf formula based on the size of s_b or such? But that seems
somewhat awkward too.
> 
> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)
> 
> There's an interesting difference difference I noticed in the run with
> checksums on PG17. The full PDF is available here:
(there's a subsequent email about this, will reply there)
> Conclusion
> ----------
> 
> That's all I have at the moment. I still think it makes sense to keep
> io_method=worker, but bump up the io_workers a bit higher. Could we also
> add some suggestions how to pick a good value to the docs?
.oO(/me ponders a troll patch to re-add a reference the number of spindles in
that tuning advice)
I'm not sure what advice to give here.  Maybe just to set it to a considerably
larger number once not running on a tiny system? The incremental overhead of
having an idle worker is rather small unless you're on a really tiny system...
> You might also run the benchmark on different hardware, and either
> build/publish the plots somewhere, or just give me the CSV and I'll do
> that. Better to find strange stuff / regressions now.
Probably the most interesting thing would be some runs with cloud-ish storage
(relatively high iops, very high latency)...
> The repository also has branches with plots showing results with WIP
> indexscan prefetching. (It's excluded from the PDFs I presented here).
Hm, I looked for those, but I couldn't quickly find any plots that include
them.  Would I have to generate the plots from a checkout of the repo?
> The conclusions are similar to what we found here - "worker" is good
> with enough workers, io_uring is good too. Sync has issues for some of
> the data sets, but still helps a lot.
Nice.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2025-07-14 18:41:49 | Re: Disable parallel query by default | 
| Previous Message | Álvaro Herrera | 2025-07-14 18:23:58 | Re: pg_dump does not dump domain not-null constraint's comments |