From: | wenhui qiu <qiuwenhuifx(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Should we update the random_page_cost default value? |
Date: | 2025-10-06 04:18:56 |
Message-ID: | CAGjGUALdn-K3gks_kDUZQhY+_5CbVfYyaWz-8fmMXn303O4XiQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Tomas
I really can't agree more. Many default values are just too conservative,
and the documentation doesn't provide best practices.,i think reduce to
1.x,Or add a tip in the document, providing a recommended value for
different SSDs.
On Mon, 6 Oct 2025 at 08:59, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Hi,
>
> I wonder if it's time to consider updating the random_page_cost default
> value. There are multiple reasons to maybe do that.
>
> The GUC (and the 4.0 default) was introduced in ~2000 [1], so ~25 years
> ago. During that time the world went from rotational drives through
> multiple generations of flash / network-attached storage. I find it hard
> to believe those changes wouldn't affect random_page_cost.
>
> And indeed, it's common to advice to reduce the GUC closer to 1.0 on
> SSDs. I myself recommended doing that in the past, but over time I got
> somewhat skeptical about it. The advice is based on the "obvious" wisdom
> that SSDs are much better in handling random I/O than rotational disks.
>
> But this has two flaws. First, it assumes the current 4.0 default makes
> sense - maybe it doesn't and then it's useless as a "starting point".
> Second, it's not obvious how much better SSDs are without concurrent IOs
> (which is needed to fully leverage the SSDs). Which we don't do for
> index scans (yet), and even if we did, the cost model has no concept for
> such concurrency.
>
> Recently, I've been doing some experiments evaluating how often we pick
> an optimal scan for simple select queries, assuming accurate estimates.
> Turns out we pick the wrong plan fairly often, even with almost perfect
> estimates. I somewhat expected that, with the default random_page_cost
> value. What did surprise me was that to improve the plans, I had to
> *increase* the value, even on really new/fast SSDs ...
>
> I started looking at how we calculated the 4.0 default back in 2000.
> Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2].
> But he outlined how the experiment worked:
>
> - generate large table (much bigger than RAM)
> - measure runtime of seq scan
> - measure runtime of full-table index scan
> - calculate how much more expensive a random page access is
>
> So I decided to try doing this on a couple different devices, and see
> what random_page_cost values that gives me. Attached is a script doing
> such benchmark:
>
> (1) initializes a new cluster, with a couple parameters adjusted
>
> (2) creates a random table (with uniform distribution)
>
> (3) runs a sequential scan
>
> SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;
>
> (4) runs an index scan
>
> SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;
>
> The script does a couple things to force the query plan, and it reports
> timings of the two queries at the end.
>
> I've been running this on machines with 64GB of RAM, so I chose the
> table to have 500M rows. With fillfactor=20 that means a ~182GB table
> (23809524 pages).
>
> Lets say that
>
> T(seq) = timing of the seqscan query
> T(idx) = timing of the index scan
> IOS(seq) = number of sequential page reads
> IOS(idx) = number of random page reads
> P = number of pages
> N = number of rows
>
> then time to read a sequential page (because IOS(seq) == P)
>
> PT(seq) = T(seq) / IOS(seq) = T(seq) / P
>
> and time to read a random page (assuming the table is perfectly random,
> with no cache hits):
>
> PT(idx) = T(idx) / IOS(idx) = T(idx) / N
>
> which gives us the "idea" random page cost (as a value relative to
> reading a page sequentially)
>
> random_page_cost = PT(idx) / PT(seq)
>
> T(idx) * P
> = --------------
> T(seq) * N
>
> The "no cache hits" is not quite correct, with 182GB there's about 30%
> of a page being in memory. I didn't think of using debug_io_direct at
> the time, but it doesn't affect the conclusion very much (in fact, it
> would *increase* the random_page_cost value, which makes it worse).
>
> I did this on the three SSDs I use for testing - 4x NVMe RAID0, 4x SATA
> RAID0, and a single NVMe drive. Here's the results:
>
> seqscan (s) index scan (s) random_page_cost
> -----------------------------------------------------------------
> NVMe 98 42232 20.4
> NVMe/RAID0 24 25462 49.3
> SATA/RAID0 109 48141 21.0
>
> These are reasonably good SSDs, and yet the "correct" random_page cost
> comes out about 5-10x of our default.
>
> FWIW I double checked the test is actually I/O bound. The CPU usage
> never goes over ~50% (not even in the the seqscan case).
>
> These calculated values also align with the "optimal" plan choice, i.e.
> the plans flip much closer to the actual crossing point (compared to
> where it'd flip with 4.0).
>
> It obviously contradicts the advice to set the value closer to 1.0. But
> why is that? SSDs are certainly better with random I/0, even if the I/O
> is not concurrent and the SSD is not fully utilized. So the 4.0 seems
> off, the value should be higher than what we got for SSDs ...
>
> I don't have any rotational devices in my test machines anymore, but I
> got an azure VM with local "SCSI" disk, and with "standard HDD" volume.
> And I got this (this is with 10M rows, ~3.7GB, with direct I/O):
>
> seqscan (s) index scan (s) random_page_cost
> -----------------------------------------------------------------
> SCSI 2.1 1292 28.5
> standard HDD 209.8 27586 62.6
>
> I suspect the SCSI disk is not actually rotational (or which model), it
> seems more like an SSD with SCSI interface or what model is that. The
> "standard HDD" seems much closer to rotational, with ~370 IOPS (it'd
> take ages to do the index scan on more than 10M rows).
>
> Unless I did some silly mistakes, these results suggest the current 4.0
> value is a bit too low, and something like ~20 would be better even on
> SSDs. This is not the first time it was suggested a higher default might
> be better - see this 2008 post [3]. Of course, that's from before SSDs
> became a thing, it's about evolution in hard disks and our code.
>
> However, it also says this:
>
> Yeah, it seems like raising random_page_cost is not something we
> ever recommend in practice. I suspect what we'd really need here to
> make any progress is a more detailed cost model, not just fooling
> with the parameters of the existing one.
>
> I don't quite follow the reasoning. If increasing the cost model would
> require making the cost model mode detailed, why wouldn't the same thing
> apply for lowering it? I don't see a reason for asymmetry.
>
> Also, I intentionally used a table with "perfectly random" data, because
> that's about the simplest thing to estimate, and it indeed makes all the
> estimates almost perfect (including the internal ones in cost_index). If
> we can't cost such simple cases correctly, what's the point of costing?
>
> From a robustness point of view, wouldn't it be better to actually err
> on the side of using a higher random_page_cost value? That'd mean we
> flip to "more-sequential" scans sooner, with much "flatter" behavior.
> That doesn't need to be a seqscan (which is about as flat as it gets),
> but e.g. a bitmap scan - which probably silently "fixes" many cases
> where the index scan gets costed too low.
>
> It also says this:
>
> And the value of 4 seems to work well in practice.
>
> I wonder how do we know that? Most users don't experiment with different
> values very much. They just run with the default, or maybe even lower
> it, based on some recommendation. But they don't run the same query with
> different values, so they can't spot differences unless they hit a
> particularly bad plan.
>
> Of course, it's also true most workloads tend to access well cached
> data, which makes errors much cheaper. Or maybe just queries with the
> "problematic selectivities" are not that common. Still, even if it
> doesn't change the scan choice, it seems important to keep the cost
> somewhat closer to reality because of the plan nodes above ...
>
>
> It seems to me the current default is a bit too low, but changing a GUC
> this important is not trivial. So what should we do about it?
>
>
> regards
>
>
>
> [1]
> https://www.postgresql.org/message-id/flat/14601(dot)949786166(at)sss(dot)pgh(dot)pa(dot)us
>
> [2]
> https://www.postgresql.org/message-id/3866858.1728961439%40sss.pgh.pa.us
>
> [3] https://www.postgresql.org/message-id/23625.1223642230%40sss.pgh.pa.us
>
> --
> Tomas Vondra
>
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-10-06 04:19:38 | Executing pg_createsubscriber with a non-compatible control file |
Previous Message | shveta malik | 2025-10-06 04:12:57 | Re: Clear logical slot's 'synced' flag on promotion of standby |