From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Should we update the random_page_cost default value? |
Date: | 2025-10-06 18:20:15 |
Message-ID: | fb9fb164-f2c2-4526-b174-e6dc49d53713@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/6/25 17:14, Andres Freund wrote:
> Hi,
>
> On 2025-10-06 02:59:16 +0200, Tomas Vondra wrote:
>> So I decided to try doing this on a couple different devices, and see
>> what random_page_cost values that gives me. Attached is a script doing
>> such benchmark:
>>
>> (1) initializes a new cluster, with a couple parameters adjusted
>>
>> (2) creates a random table (with uniform distribution)
>>
>> (3) runs a sequential scan
>>
>> SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;
>>
>> (4) runs an index scan
>>
>> SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;
>
> Why compare an unordered with an ordered scan? ISTM that if you want to
> actually compare the cost of two different approaches to the same query, you'd
> not want to actually change what the query does? Yes, you say that CPU time
> never gets above 50%, but 50% isn't nothing.
>
Because I need the second query to do an index scan, and without the
ORDER BY it refuses to do that (even with disabled seqscans), because we
don't even build the index path. The ordering does not add any overhead,
it merely forces reading data through the index.
OTOH I don't want ordering for the first query, because that'd add an
explicit Sort.
> It also seems that due to the ordering inside the table (the order by
> random()) during the table creation, you're going to see vastly different
> number of page accesses. While that's obviously something worth taking into
> account for planning purposes, I don't think it'd be properly done by the
> random_page_cost itself.
>
I don't understand your argument. Yes, the index scan does way more page
accesses. Essentially, each index tuple reads a new page - and without a
cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap
page, which means we have to read it ~21x. In other words, it reads as
many pages as there are tuples.
This is why my formulas divide the seqscan timing by number of pages,
but indexscan is divided by number of tuples. And AFAIK the costing does
exactly that too. So the random_page_cost doesn't need to do anything
special, we estimate the number of page reads. (And at least in the case
of random table it's pretty accurate.)
>
> I think doing this kind of measurement via normal SQL query processing is
> almost always going to have too much other influences. I'd measure using fio
> or such instead. It'd be interesting to see fio numbers for your disks...
>
> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5 --ioengine pvsync --iodepth 1
> vs --rw randread
>
> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
>
I can give it a try. But do we really want to strip "our" overhead with
reading data?
>
>> It obviously contradicts the advice to set the value closer to 1.0. But
>> why is that? SSDs are certainly better with random I/0, even if the I/O
>> is not concurrent and the SSD is not fully utilized. So the 4.0 seems
>> off, the value should be higher than what we got for SSDs ...
>
> I'd guess that the *vast* majority of PG workloads these days run on networked
> block storage. For those typically the actual latency at the storage level is
> a rather small fraction of the overall IO latency, which is instead dominated
> by network and other related cost (like the indirection to which storage
> system to go to and crossing VM/host boundaries). Because the majority of the
> IO latency is not affected by the storage latency, but by network lotency, the
> random IO/non-random IO difference will play less of a role.
>
True. I haven't tested that.
>
>
>> From a robustness point of view, wouldn't it be better to actually err
>> on the side of using a higher random_page_cost value? That'd mean we
>> flip to "more-sequential" scans sooner, with much "flatter" behavior.
>> That doesn't need to be a seqscan (which is about as flat as it gets),
>> but e.g. a bitmap scan - which probably silently "fixes" many cases
>> where the index scan gets costed too low.
>
> I think it's often the exact opposite - folks use a lower random page cost to
> *prevent* the planner from going to sequential (or bitmap heap) scans. In many
> real-world queries our selectivity estimates aren't great and the performance
> penalties of switching from an index scan to a sequential scan are really
> severe. As you note, this is heavily exascerbated by the hot data often being
> cached, but cold data not. Obviously the seqscan will process the cold data
> too.
>
I understand your point, but I'm not convinced random_page_cost is the
right tool to fix incorrect estimates.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2025-10-06 18:27:41 | Re: Support getrandom() for pg_strong_random() source |
Previous Message | Andres Freund | 2025-10-06 18:16:21 | Re: [PATCH] Better Performance for PostgreSQL with large INSERTs |