Re: Should we update the random_page_cost default value?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Should we update the random_page_cost default value?
Date: 2025-10-06 21:27:12
Message-ID: mnxs6dzq5wn7glrf4e7jlz4gsnhbgaj63hkwavcfk36mdv5jyw@azhbwcmmibwn
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-10-06 20:20:15 +0200, Tomas Vondra wrote:
> On 10/6/25 17:14, Andres Freund wrote:
> > It also seems that due to the ordering inside the table (the order by
> > random()) during the table creation, you're going to see vastly different
> > number of page accesses. While that's obviously something worth taking into
> > account for planning purposes, I don't think it'd be properly done by the
> > random_page_cost itself.
> >
>
> I don't understand your argument. Yes, the index scan does way more page
> accesses. Essentially, each index tuple reads a new page - and without a
> cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap
> page, which means we have to read it ~21x. In other words, it reads as
> many pages as there are tuples.
>
> This is why my formulas divide the seqscan timing by number of pages,
> but indexscan is divided by number of tuples. And AFAIK the costing does
> exactly that too. So the random_page_cost doesn't need to do anything
> special, we estimate the number of page reads. (And at least in the case
> of random table it's pretty accurate.)

The thing is that your queries have vastly different timings *regardless* of
IO. Here's an example, fully cached:

postgres[2718803][1]=# EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM test_table) OFFSET 100000000;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=288096.16..288096.16 rows=1 width=41) (actual time=392.066..392.067 rows=0.00 loops=1) │
│ Buffers: shared hit=238096 │
│ -> Seq Scan on test_table (cost=0.00..288096.16 rows=5000016 width=41) (actual time=0.032..269.031 rows=5000000.00 loops=1) │
│ Buffers: shared hit=238096 │
│ Planning Time: 0.087 ms │
│ Execution Time: 392.082 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

postgres[2759534][1]=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM (SELECT ctid,* FROM test_table ORDER BY id) offset 100000000;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=1082201.28..1082201.28 rows=1 width=47) (actual time=3025.902..3025.903 rows=0.00 loops=1) │
│ Buffers: shared hit=5013643 │
│ -> Index Scan using test_table_id_idx on test_table (cost=0.43..1082201.28 rows=5000016 width=47) (actual time=0.035..2907.699 rows=5000000.00 loops=1) │
│ Index Searches: 1 │
│ Buffers: shared hit=5013643 │
│ Planning Time: 0.131 ms │
│ Execution Time: 3025.930 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(7 rows)

You have a ~8x difference in time, even without a single IO. A lot of that is
probably just the 20x difference in buffer accesses, but also the index path
just being a lot less predictable for the CPU.

I don't think random/seq_page_cost should account for difference in processing
costs that aren't actually related to it being a sequential or a random IO.

> > I think doing this kind of measurement via normal SQL query processing is
> > almost always going to have too much other influences. I'd measure using fio
> > or such instead. It'd be interesting to see fio numbers for your disks...
> >
> > fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5 --ioengine pvsync --iodepth 1
> > vs --rw randread
> >
> > gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
> >
>
> I can give it a try. But do we really want to strip "our" overhead with
> reading data?

I think the relevant differences in efficiency here don't come from something
inherent in postgres, it's that you're comparing completely different code
paths that have different IO independent costs. You could totally measure it
in postgres, just not easily via "normal" SQL queries on tables/indexes.

Maybe we should have a function that measures some basic IO "factors" for each
tablespace and for pg_wal.

You can do a decent approximation of this via postgres too, by utilizing
pg_prewarm and/or pageinspect. E.g. something like

random IO:
SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0, (pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY random(), blkno);
and for sequential IO
SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0, (pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY blkno, random());

Time for sequential IO: 8066.720 ms
Time for random IO: 31623.122 ms

That's, almost eerily, close to 4x :)

FWIW, when cached, the difference in runtime between the two variants is
neglegible.

I guess it'd actually be better to use mode=>'read' (to avoid the constant
cost of the memcpy into the buffer pool from reducing the impact of sequential
vs random IO), in which case the difference in IO time increases to a more
substantial 9x for the uncached case.

> >> From a robustness point of view, wouldn't it be better to actually err
> >> on the side of using a higher random_page_cost value? That'd mean we
> >> flip to "more-sequential" scans sooner, with much "flatter" behavior.
> >> That doesn't need to be a seqscan (which is about as flat as it gets),
> >> but e.g. a bitmap scan - which probably silently "fixes" many cases
> >> where the index scan gets costed too low.
> >
> > I think it's often the exact opposite - folks use a lower random page cost to
> > *prevent* the planner from going to sequential (or bitmap heap) scans. In many
> > real-world queries our selectivity estimates aren't great and the performance
> > penalties of switching from an index scan to a sequential scan are really
> > severe. As you note, this is heavily exascerbated by the hot data often being
> > cached, but cold data not. Obviously the seqscan will process the cold data
> > too.
> >
>
> I understand your point, but I'm not convinced random_page_cost is the
> right tool to fix incorrect estimates.

I don't think it is - but I do think that if we just increased
random_page_cost substantially, without other accompanying changes, we'll
cause a lot of problems for folks, due to the above issue.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-10-06 21:48:16 Re: VM corruption on standby
Previous Message Tom Lane 2025-10-06 21:14:35 Re: psql: Count all table footer lines in pager setup