Re: Should we update the random_page_cost default value?

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Should we update the random_page_cost default value?
Date: 2025-10-08 10:12:10
Message-ID: e47dadd6-d283-42ff-b537-72e3aab7d349@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/8/25 02:04, Tomas Vondra wrote:
>
>
> On 10/7/25 23:08, Peter Geoghegan wrote:
>> On Tue, Oct 7, 2025 at 3:46 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> I think this discrepancy is largely due to the fact that Tomas' is testing
>>> with a cold cache (he has numbers for both), whereas most production workloads
>>> have very high cache hit ratios.
>>
>> Any test case that fails to ensure that all relevant indexes at least
>> have all of their internal B-Tree pages in shared_buffers is extremely
>> unrealistic. That only requires that we cache only a fraction of 1% of
>> all index pages, which is something that production workloads manage
>> to do approximately all the time.
>>
>> I wonder how much the "cold" numbers would change if Tomas made just
>> that one tweak (prewarming only the internal index pages). I don't
>> think that there's a convenient way of running that experiment right
>> now -- but it would be relatively easy to invent one.
>>
>> I'm not claiming that this extra step would make the "cold" numbers
>> generally representative. Just that it might be enough on its own to
>> get wildly better results, which would put the existing "cold" numbers
>> in context.
>>
>
> Why would you expect that?
>
> The index size is about 5% of the table size, so why would the internal
> index pages make any meaningful difference beyond that?
>
> Also, it's true the test starts from "cold" cache, but is still uses
> shared buffers - and I'd expect the internal pages to be very hot,
> compared to the heap. Heap pages are read ~21x, but very far apart. So
> the internal index pages are likely cached even in the cold runs.
>
> I can do some runs after prewarming the (whole) index, just to see if it
> makes any difference.
>

I tried measuring this. I pre-warmed the internal index pages using
pageinspect and pg_prewarm like this:

1) use bt_multi_page_stats to select non-leaf pages
2) use pg_prewarm to load these pages

The query looks like this:

SELECT 'SELECT pg_prewarm(''idx'', ''buffer'', ''main'', ' || blkno ||
', ' || blkno || ');' AS sql
FROM (
-- list of non-leaf index pages
SELECT blkno FROM bt_multi_page_stats('idx', 1, (select relpages - 1
from pg_class where relname = 'idx')) where type != 'l'
)

The generated SQL is written to a SQL script, and then executed after
restarting the instance / dropping caches.

As I expected, it made absolutely no difference. These are the results
for the NVMe RAID0:

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
prewarming: no 24 25462 49.3
prewarming: yes 24 25690 49.7

No surprise here. It's a little bit slower, but that's well within a
run-to-run variability.

There's only ~5000 non-leaf index pages, It'd be very surprising if it
made any difference on a table with 23809524 pages (and when the
indexscan does 500M page accesses).

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2025-10-08 10:19:57 Re: Logical Replication of sequences
Previous Message David Rowley 2025-10-08 10:09:30 VACUUM (PARALLEL) option processing not using DefElem the way it was intended