From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Should we update the random_page_cost default value? |
Date: | 2025-10-06 15:14:13 |
Message-ID: | u52uxp6dcug3xjshe7qricmzs7cwjouuci5qlxesmqaaahng62@web4zmsl3rzw |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-10-06 02:59:16 +0200, Tomas Vondra wrote:
> So I decided to try doing this on a couple different devices, and see
> what random_page_cost values that gives me. Attached is a script doing
> such benchmark:
>
> (1) initializes a new cluster, with a couple parameters adjusted
>
> (2) creates a random table (with uniform distribution)
>
> (3) runs a sequential scan
>
> SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;
>
> (4) runs an index scan
>
> SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;
Why compare an unordered with an ordered scan? ISTM that if you want to
actually compare the cost of two different approaches to the same query, you'd
not want to actually change what the query does? Yes, you say that CPU time
never gets above 50%, but 50% isn't nothing.
It also seems that due to the ordering inside the table (the order by
random()) during the table creation, you're going to see vastly different
number of page accesses. While that's obviously something worth taking into
account for planning purposes, I don't think it'd be properly done by the
random_page_cost itself.
I think doing this kind of measurement via normal SQL query processing is
almost always going to have too much other influences. I'd measure using fio
or such instead. It'd be interesting to see fio numbers for your disks...
fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5 --ioengine pvsync --iodepth 1
vs --rw randread
gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
> It obviously contradicts the advice to set the value closer to 1.0. But
> why is that? SSDs are certainly better with random I/0, even if the I/O
> is not concurrent and the SSD is not fully utilized. So the 4.0 seems
> off, the value should be higher than what we got for SSDs ...
I'd guess that the *vast* majority of PG workloads these days run on networked
block storage. For those typically the actual latency at the storage level is
a rather small fraction of the overall IO latency, which is instead dominated
by network and other related cost (like the indirection to which storage
system to go to and crossing VM/host boundaries). Because the majority of the
IO latency is not affected by the storage latency, but by network lotency, the
random IO/non-random IO difference will play less of a role.
> From a robustness point of view, wouldn't it be better to actually err
> on the side of using a higher random_page_cost value? That'd mean we
> flip to "more-sequential" scans sooner, with much "flatter" behavior.
> That doesn't need to be a seqscan (which is about as flat as it gets),
> but e.g. a bitmap scan - which probably silently "fixes" many cases
> where the index scan gets costed too low.
I think it's often the exact opposite - folks use a lower random page cost to
*prevent* the planner from going to sequential (or bitmap heap) scans. In many
real-world queries our selectivity estimates aren't great and the performance
penalties of switching from an index scan to a sequential scan are really
severe. As you note, this is heavily exascerbated by the hot data often being
cached, but cold data not. Obviously the seqscan will process the cold data
too.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-10-06 15:24:39 | Re: src/include/utils/float.h comment one link stable |
Previous Message | Tom Lane | 2025-10-06 15:13:24 | Re: split func.sgml to separated individual sgml files |