Re: Should we update the random_page_cost default value?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Should we update the random_page_cost default value?
Date: 2025-10-06 16:08:56
Message-ID: c7dryab6hx5bu2qob3vpledzfgcywibxyxlwlfhljpugzl7pen@ahxgegh4qvq2
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-10-06 01:53:05 -0400, Tom Lane wrote:
> Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> writes:
> > On Mon, 2025-10-06 at 01:29 -0400, Tom Lane wrote:
> >> But if what
> >> we're trying to model is net resource demands, with an eye to
> >> minimizing the total system load not execution time of any one query,
> >> maybe we can continue to work with something close to what we've
> >> traditionally done.
>
> > Did anybody propose that?
>
> I just did ;-). If we don't adopt a mindset along that line,
> then AIO is going to require some *radical* changes in the
> planner's I/O cost models.

FWIW, I think some vaguely-AIO related cost concept already ought to long have
been taken into account, but weren't. I'm not trying to say that we don't need
to do work, but that asynchronizity related issues are bigger than explicit
asynchronuous IO.

E.g., despite there often being a >10x performance difference between a
forward and backward index scan in case the index and table order are well
correlated, we didn't cost them any differently. The reason for that being
that the OS will do readahead for forward scans, but not backward scans.

Here's an example explain analyse on a disk with an artification 1ms latency
(to simulate networked storage):

echo 3 |sudo tee /proc/sys/vm/drop_caches && psql -Xq -f ~/tmp/evict.sql && \
/usr/bin/time -f '%e' psql -Xq -c 'explain analyze SELECT * FROM aiobench ORDER BY serial_val ASC LIMIT 1 OFFSET 100000;'

QUERY PLAN >
-------------------------------------------------------------------------------------------------------------------------------------------------------------->
Limit (cost=4455.58..4455.62 rows=1 width=120) (actual time=49.820..49.821 rows=1.00 loops=1)
Buffers: shared hit=553 read=2145
I/O Timings: shared read=36.096
-> Index Scan using aiobench_serial_val_idx on aiobench (cost=0.57..8908138.12 rows=199957824 width=120) (actual time=2.971..47.261 rows=100001.00 loops=>
Index Searches: 1
Buffers: shared hit=553 read=2145
I/O Timings: shared read=36.096
Planning:
Buffers: shared hit=113 read=23
I/O Timings: shared read=10.618
Planning Time: 10.981 ms
Execution Time: 49.838 ms
(12 rows)

echo 3 |sudo tee /proc/sys/vm/drop_caches && psql -Xq -f ~/tmp/evict.sql && \
/usr/bin/time -f '%e' psql -Xq -c 'explain analyze SELECT * FROM aiobench ORDER BY serial_val DESC LIMIT 1 OFFSET 100000;'
QUERY PLAN >
-------------------------------------------------------------------------------------------------------------------------------------------------------------->
Limit (cost=4455.58..4455.62 rows=1 width=120) (actual time=2138.047..2138.049 rows=1.00 loops=1)
Buffers: shared hit=739 read=2138
I/O Timings: shared read=2123.844
-> Index Scan Backward using aiobench_serial_val_idx on aiobench (cost=0.57..8908138.12 rows=199957824 width=120) (actual time=4.912..2135.519 rows=10000>
Index Searches: 1
Buffers: shared hit=739 read=2138
I/O Timings: shared read=2123.844
Planning:
Buffers: shared hit=112 read=23
I/O Timings: shared read=10.446
Planning Time: 10.816 ms
Execution Time: 2138.067 ms
(12 rows)

36ms vs 2123ms in IO time, with the same cost.

Of course these queries *do* something different, but we often choose ordered
index scans as a way of implementing a query (e.g. as part of fast start
plans), where there are other alternatives of implementing the same query.

Another example of costing around this being bad is that we do not take
effective_io_concurrency into account when planning bitmap heap scans, despite
that often making a *massive* difference in whether a bitmap heap scan is a
good choice or a bad choice. E.g. on the system with the simulated 1ms IO
latency, the difference between 1 and 32 is vast.

eic=1:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=606243.51..606243.52 rows=1 width=8) (actual time=184218.775..184218.776 rows=1.00 loops=1)
Buffers: shared hit=2 read=194846
I/O Timings: shared read=183913.048
-> Bitmap Heap Scan on aiobench (cost=3900.18..605784.10 rows=183767 width=8) (actual time=79.811..184202.181 rows=199465.00 loops=1)
Recheck Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Heap Blocks: exact=194299
Buffers: shared hit=2 read=194846
I/O Timings: shared read=183913.048
-> Bitmap Index Scan on aiobench_random_val_idx (cost=0.00..3854.24 rows=183767 width=0) (actual time=37.287..37.288 rows=199465.00 loops=1)
Index Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Index Searches: 1
Buffers: shared hit=2 read=547
I/O Timings: shared read=4.972
Planning:
Buffers: shared hit=76 read=24
I/O Timings: shared read=11.571
Planning Time: 12.953 ms
Execution Time: 184218.986 ms

eic=64:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=606243.51..606243.52 rows=1 width=8) (actual time=2962.965..2962.966 rows=1.00 loops=1)
Buffers: shared hit=2 read=194846
I/O Timings: shared read=2316.070
-> Bitmap Heap Scan on aiobench (cost=3900.18..605784.10 rows=183767 width=8) (actual time=82.871..2947.892 rows=199465.00 loops=1)
Recheck Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Heap Blocks: exact=194299
Buffers: shared hit=2 read=194846
I/O Timings: shared read=2316.070
-> Bitmap Index Scan on aiobench_random_val_idx (cost=0.00..3854.24 rows=183767 width=0) (actual time=38.526..38.526 rows=199465.00 loops=1)
Index Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Index Searches: 1
Buffers: shared hit=2 read=547
I/O Timings: shared read=5.021
Planning:
Buffers: shared hit=76 read=24
I/O Timings: shared read=11.485
Planning Time: 12.890 ms
Execution Time: 2963.154 ms
(18 rows)

A ~60x query execution difference that's not at all represented in the cost
model. The former makes a bitmap heap scan a terrible choice, the latter a
good one...

> > I was under the impression that PostgreSQL's cost model tries to
> > estimate and optimize execution time, not resource consumption.
>
> Yup, that's our traditional view of it. But I wonder how we
> will make such estimates in a parallel-I/O world, especially
> if we don't try to account for concurrent query activity.
> (Which is a place I don't want to go, because it would render
> planning results utterly irreproducible.)

Another complicated aspect around this is that we don't take caching into
account in any real-world reflecting way. In common workloads the hotly
accessed data is all in shared_buffers, but the cold portion of the data is
not. In such scenarios switching e.g. from an index scan to a sequential scan
will be way worse, since it'll likely eat up a good portion of the IO
bandwidth and possibly pollute the buffer pool, than in a scenario where all
the data is cold.

At the same time, leaving the difficulty of estimating that aside, making
plans depend on the current state of the buffer pool has some fairly obvious,
and massive, downsides. Like unstable plans and never actually ending up in
the "high performance" situation after a restart, due choosing plans that just
work for an empty buffer pool.

I do not have the *slightest* idea for how to improve the situation around
this, even though I think it's fairly important.

> > But perhaps I misunderstood, or perhaps I am just too conservative.
>
> I'm normally pretty conservative also about changing planner
> behavior. But in this context I think we need to be wary of
> thinking too small. The fact that people keep coming out with
> different ideas of what random_page_cost needs to be suggests
> that there's something fundamentally wrong with the concept.

I think one of the big issues with random_page_cost is that it combines two
largely independent things:

1) the increased cost of doing a random IO over sequential IO
2) that random IOs very often are synchronuous and hard to predict / unpredictable

But we had support for doing readahead for some random IO for a long time (the
posix_fadvise() calls within bitmap heap scans), just without taking that into
account from a costing POV.

I suspect that we'll continue to need to somehow distinguish between
random/non-random IO, the differences are simply too large at the storage
level to ignore.

But that we need to add accounting for whether IO is synchronuous and when
not. If you have a bunch of random IOs, but you cheaply know them ahead of
time (say in bitmap heap scan), there should be a different cost for the query
than if there a bunch of random IOs that we cannot realistically predict (say
the IO for inner index pages in an ordered index scan or all accesses as part
of an index nested loop where the inner side is unique).

Unfortunately that will probably make it more problematic that we aren't
modeling resource consumption - costing a query that does 10x as many, but
prefetchable, IOs than a ever so slightly more expensive query is probably not
a tradeoff that most want to pay.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2025-10-06 16:37:43 Re: split func.sgml to separated individual sgml files
Previous Message Álvaro Herrera 2025-10-06 16:00:46 Re: split func.sgml to separated individual sgml files