From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Robert Treat <rob(at)xzilla(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Should we update the random_page_cost default value? |
Date: | 2025-10-08 10:12:10 |
Message-ID: | e47dadd6-d283-42ff-b537-72e3aab7d349@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10/8/25 02:04, Tomas Vondra wrote:
>
>
> On 10/7/25 23:08, Peter Geoghegan wrote:
>> On Tue, Oct 7, 2025 at 3:46 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> I think this discrepancy is largely due to the fact that Tomas' is testing
>>> with a cold cache (he has numbers for both), whereas most production workloads
>>> have very high cache hit ratios.
>>
>> Any test case that fails to ensure that all relevant indexes at least
>> have all of their internal B-Tree pages in shared_buffers is extremely
>> unrealistic. That only requires that we cache only a fraction of 1% of
>> all index pages, which is something that production workloads manage
>> to do approximately all the time.
>>
>> I wonder how much the "cold" numbers would change if Tomas made just
>> that one tweak (prewarming only the internal index pages). I don't
>> think that there's a convenient way of running that experiment right
>> now -- but it would be relatively easy to invent one.
>>
>> I'm not claiming that this extra step would make the "cold" numbers
>> generally representative. Just that it might be enough on its own to
>> get wildly better results, which would put the existing "cold" numbers
>> in context.
>>
>
> Why would you expect that?
>
> The index size is about 5% of the table size, so why would the internal
> index pages make any meaningful difference beyond that?
>
> Also, it's true the test starts from "cold" cache, but is still uses
> shared buffers - and I'd expect the internal pages to be very hot,
> compared to the heap. Heap pages are read ~21x, but very far apart. So
> the internal index pages are likely cached even in the cold runs.
>
> I can do some runs after prewarming the (whole) index, just to see if it
> makes any difference.
>
I tried measuring this. I pre-warmed the internal index pages using
pageinspect and pg_prewarm like this:
1) use bt_multi_page_stats to select non-leaf pages
2) use pg_prewarm to load these pages
The query looks like this:
SELECT 'SELECT pg_prewarm(''idx'', ''buffer'', ''main'', ' || blkno ||
', ' || blkno || ');' AS sql
FROM (
-- list of non-leaf index pages
SELECT blkno FROM bt_multi_page_stats('idx', 1, (select relpages - 1
from pg_class where relname = 'idx')) where type != 'l'
)
The generated SQL is written to a SQL script, and then executed after
restarting the instance / dropping caches.
As I expected, it made absolutely no difference. These are the results
for the NVMe RAID0:
seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
prewarming: no 24 25462 49.3
prewarming: yes 24 25690 49.7
No surprise here. It's a little bit slower, but that's well within a
run-to-run variability.
There's only ~5000 non-leaf index pages, It'd be very surprising if it
made any difference on a table with 23809524 pages (and when the
indexscan does 500M page accesses).
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2025-10-08 10:19:57 | Re: Logical Replication of sequences |
Previous Message | David Rowley | 2025-10-08 10:09:30 | VACUUM (PARALLEL) option processing not using DefElem the way it was intended |