Quick Links

Re: Should we update the random_page_cost default value?

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Should we update the random_page_cost default value?
Date:	2025-10-07 12:08:27
Message-ID:	3ec322cd-e368-4d55-b9ba-da252a390867@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 10/7/25 01:56, Andres Freund wrote:
> Hi,
>
> On 2025-10-07 00:52:26 +0200, Tomas Vondra wrote:
>> On 10/6/25 23:27, Andres Freund wrote:
>>> On 2025-10-06 20:20:15 +0200, Tomas Vondra wrote:
>>>> On 10/6/25 17:14, Andres Freund wrote:
>>>>> It also seems that due to the ordering inside the table (the order by
>>>>> random()) during the table creation, you're going to see vastly different
>>>>> number of page accesses. While that's obviously something worth taking into
>>>>> account for planning purposes, I don't think it'd be properly done by the
>>>>> random_page_cost itself.
>>>>>
>>>>
>>>> I don't understand your argument. Yes, the index scan does way more page
>>>> accesses. Essentially, each index tuple reads a new page - and without a
>>>> cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap
>>>> page, which means we have to read it ~21x. In other words, it reads as
>>>> many pages as there are tuples.
>>>>
>>>> This is why my formulas divide the seqscan timing by number of pages,
>>>> but indexscan is divided by number of tuples. And AFAIK the costing does
>>>> exactly that too. So the random_page_cost doesn't need to do anything
>>>> special, we estimate the number of page reads. (And at least in the case
>>>> of random table it's pretty accurate.)
>>>
>>> The thing is that your queries have vastly different timings *regardless* of
>>> IO. Here's an example, fully cached:
>>>
>>> postgres[2718803][1]=# EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM test_table) OFFSET 100000000;
>>> ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
>>> │ QUERY PLAN │
>>> ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
>>> │ Limit (cost=288096.16..288096.16 rows=1 width=41) (actual time=392.066..392.067 rows=0.00 loops=1) │
>>> │ Buffers: shared hit=238096 │
>>> │ -> Seq Scan on test_table (cost=0.00..288096.16 rows=5000016 width=41) (actual time=0.032..269.031 rows=5000000.00 loops=1) │
>>> │ Buffers: shared hit=238096 │
>>> │ Planning Time: 0.087 ms │
>>> │ Execution Time: 392.082 ms │
>>> └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>>> (6 rows)
>>>
>>> postgres[2759534][1]=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM (SELECT ctid,* FROM test_table ORDER BY id) offset 100000000;
>>> ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
>>> │ QUERY PLAN │
>>> ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
>>> │ Limit (cost=1082201.28..1082201.28 rows=1 width=47) (actual time=3025.902..3025.903 rows=0.00 loops=1) │
>>> │ Buffers: shared hit=5013643 │
>>> │ -> Index Scan using test_table_id_idx on test_table (cost=0.43..1082201.28 rows=5000016 width=47) (actual time=0.035..2907.699 rows=5000000.00 loops=1) │
>>> │ Index Searches: 1 │
>>> │ Buffers: shared hit=5013643 │
>>> │ Planning Time: 0.131 ms │
>>> │ Execution Time: 3025.930 ms │
>>> └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>>> (7 rows)
>>>
>>> You have a ~8x difference in time, even without a single IO. A lot of that is
>>> probably just the 20x difference in buffer accesses, but also the index path
>>> just being a lot less predictable for the CPU.
>>>
>>
>> Right. And that's why the timing is divided by *pages* for sequential
>> scans, and by *rows* for index scan. Which for the timings you show ends
>> up doing this:
>>
>> seqscan = 392 / 238096 = 0.0016
>> indexescan = 3025 / 5M = 0.0006
>>
>> So the per-page overhead with index scans is 1/3 of seqscans. AFAICS
>> this directly contradicts your argument that random_page_cost ends up
>> too high due to indexscans paying more per page.
>
> I don't think that's contradicting it. The seqscan is doing ~21 tuples per
> page, the indexscan ~1. There is more work happening for each sequentially
> scanned page. And of course the number of pages pinned isn't the only factor.
>
> Just look at a profile of the index query in the cached case:
>
> - 99.98% 0.00% postgres postgres [.] standard_ExecutorRun
> standard_ExecutorRun
> - ExecProcNodeInstr
> - 99.85% ExecLimit
> - 99.26% ExecProcNodeInstr
> - 98.06% ExecScan
> - 92.13% IndexNext
> - 91.07% index_getnext_slot
> - 85.55% heapam_index_fetch_tuple
> - 47.69% ReleaseAndReadBuffer
> + 42.17% StartReadBuffer
> 1.78% hash_bytes
> 1.63% UnpinBufferNoOwner
> 15.94% heap_hot_search_buffer
> 13.22% heap_page_prune_opt
> + 3.70% ExecStoreBufferHeapTuple
> 2.15% LWLockAcquire
> 1.33% LWLockRelease
> - 3.96% index_getnext_tid
> + 3.44% btgettuple
> + 4.50% ExecInterpExpr
>
> and compare that with the seqscan:
>
> - 99.93% 0.00% postgres postgres [.] standard_ExecutorRun
> standard_ExecutorRun
> - ExecProcNodeInstr
> - 98.64% ExecLimit
> - 93.10% ExecProcNodeInstr
> - 81.23% ExecSeqScan
> - 70.30% heap_getnextslot
> - 56.28% heapgettup_pagemode
> - 30.99% read_stream_next_buffer
> + 29.16% StartReadBuffer
> - 10.99% heap_prepare_pagescan
> 7.13% heap_page_prune_opt
> 1.08% LWLockAcquire
> 1.42% LWLockRelease
> 1.13% ReleaseBuffer
> - 8.28% ExecStoreBufferHeapTuple
> 2.00% UnpinBufferNoOwner
> 2.26% MemoryContextReset
> 1.31% heapgettup_pagemode
> 1.27% ExecStoreBufferHeapTuple
> 3.16% InstrStopNode
> 2.82% InstrStartNode
> 1.37% heap_getnextslot
> 1.00% InstrStartNode
> 1.30% ExecProcNodeInstr
>
>

That's exactly my point. The index scan is doing *less* work per page.
It's accessing more pages (21x in total), but the per-page cost is
significantly lower.

>
>> I'd bet these costs to be pretty negligible compared to the cost of
>> actual I/O. So I don't see how would this explain the timing difference
>> with cold data.
>
> Of course there's an increased cost of random IO, I never doubted that! I'm
> just saying that your method of measurement seems to over-estimate it.
>

If the per-page overhead is lower for random I/O (0.0006 vs. 0.0016),
why would it lead to over-estimation? AFAIK it leads exactly to the
opposite thing.

Let's try to eliminate this overhead from the calculations. I measured
sequential and index scan on a 10M table, with everything cached (in
shared buffers). I got 260ms for sequential, 5,200ms for index. The 50M
table would need ~13,000ms and 260,000ms. Let's assume all of this is
the "unfair" overhead, and let's subtract that from the timings, leaving
just the "clean" I/O costs. That gives us this (for the NVMe RAID0):

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
NVMe/RAID0 11 25202 103.4

Which is about twice of what it used to be:

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
NVMe/RAID0 24 25462 49.3

So, what am I missing?

>
>>> I don't think random/seq_page_cost should account for difference in processing
>>> costs that aren't actually related to it being a sequential or a random IO.
>>>
>>
>> Perhaps. If it's not directly related to randomness of I/O, then maybe
>> random_page_cost is not the right cost parameter.
>>
>> But it needs to be included *somewhere*, right? And it's per-page cost,
>> and per the experiments there's very clear difference between sequential
>> and random access. So why wouldn't random_page_cost be a good fit?
>
> We are already accounting for the increased CPU cost for index scans etc to
> some degree, via cpu_* costs. If we under-estimate the CPU cost of index scans
> we should fix that, regardless of random_page_cost, as a well correlated index
> scan *still* is a lot more expensive than a sequential scan.
>

Yeah, that makes sense. If this really is a CPU cost, then it should be
reflected in some cpu_ parameter. But which one? I don't think we have a
per-page CPU cost?

>
>>>>> I think doing this kind of measurement via normal SQL query processing is
>>>>> almost always going to have too much other influences. I'd measure using fio
>>>>> or such instead. It'd be interesting to see fio numbers for your disks...
>>>>>
>>>>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5 --ioengine pvsync --iodepth 1
>>>>> vs --rw randread
>>>>>
>>>>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
>>>>>
>>>>
>>>> I can give it a try. But do we really want to strip "our" overhead with
>>>> reading data?

I got this on the two RAID devices (NVMe and SATA):

NVMe: 83.5k / 15.8k
SATA: 28.6k / 8.5k

So the same ballpark / ratio as your test. Not surprising, really.

>>>
>>> I think the relevant differences in efficiency here don't come from something
>>> inherent in postgres, it's that you're comparing completely different code
>>> paths that have different IO independent costs. You could totally measure it
>>> in postgres, just not easily via "normal" SQL queries on tables/indexes.
>>>
>>
>> If the two paths for seq/index access have so vastly different
>> overheads, why shouldn't that be reflected in the seq_page_cost and
>> random_page_cost?
>
> No, because a) random_page_cost is applied to other things than index access
> b) some index scans are not dominated by random accesses.
>
>
>
>>> You can do a decent approximation of this via postgres too, by utilizing
>>> pg_prewarm and/or pageinspect. E.g. something like
>>>
>>> random IO:
>>> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0, (pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY random(), blkno);
>>> and for sequential IO
>>> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0, (pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY blkno, random());
>>>
>>> Time for sequential IO: 8066.720 ms
>>> Time for random IO: 31623.122 ms
>>>
>>> That's, almost eerily, close to 4x :)
>>>
>>>
>>> FWIW, when cached, the difference in runtime between the two variants is
>>> neglegible.
>>>
>>
>> The problem is the sequential query forces the I/O blocks to be 8kB
>> each, instead of issuing larger requests (like a regular seqscan). On my
>> laptops that makes the seqscan 10x slower, so for regular seqscan the
>> ratio is way higher than 4x.
>>
>> Maybe this the kind of overhead you claim should not be considered when
>> setting random_page_cost, but I'm not convinced of that.
>
> A correlated index scan today will not do IO combining, despite being
> accounted as seq_page_cost. So just doing individual 8kB IOs actually seems to
> be the appropriate comparison. Even with table fetches in index scans doing
> IO combining as part by your work, the reads of the index data itself won't be
> combined. And I'm sure other things won't be either.
>

But that's the point. If the sequential reads do I/O combining and index
scans don't (and I don't think that will change anytime soon), then that
makes sequential I/O much more efficient / cheaper. And we better
reflect that in the cost somehow. Maybe increasing the random_page_cost
is not the right/best solution? That's possible.

>
> I'd be on-board with trying to improve the cost accounting to take IO
> combining into account in some form. I don't quite know how, off-the-cuff: The
> advantage of combining seems to quickly drop off after a few pages, but by how
> much and where exactly seems to very heavily depend on the specific
> hardware. Just dividing the number of sequential accesses by io_combine_limit
> seems like it'd be over-estimating the effect substantially.
>
>
> I do wonder if we ought to split off the CPU cost associated with both
> sequential and random_page_cost into a cpu_page_cost or such. Right now we
> IIRC largely disregard the cost of accessing some pages repeatedly, just
> because we estimate that to not incur IO costs. But of course a plan that has
> fewer pages accesses while otherwise doing the same amount of work will be
> faster, the profiles further up clearly show that we spend a fair amount of
> time in buffer handling even when fully cached.
>

Seems reasonable.

regards

--
Tomas Vondra

In response to

Re: Should we update the random_page_cost default value? at 2025-10-06 23:56:06 from Andres Freund

Responses

Re: Should we update the random_page_cost default value? at 2025-10-07 14:23:36 from Tomas Vondra
Re: Should we update the random_page_cost default value? at 2025-10-07 14:35:45 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2025-10-07 12:16:04	Re: Logical Replication of sequences
Previous Message	Viktor Holmberg	2025-10-07 12:01:20	Re: INSERT ... ON CONFLICT DO SELECT [FOR ...] take 2