From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Should we update the random_page_cost default value? |
Date: | 2025-10-06 23:56:06 |
Message-ID: | kiej6x5p4y3wqyk6jn2y2qv7mxygbyb3ux6gqbha5moizj7tro@zyctdksxbw57 |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-10-07 00:52:26 +0200, Tomas Vondra wrote:
> On 10/6/25 23:27, Andres Freund wrote:
> > On 2025-10-06 20:20:15 +0200, Tomas Vondra wrote:
> >> On 10/6/25 17:14, Andres Freund wrote:
> >>> It also seems that due to the ordering inside the table (the order by
> >>> random()) during the table creation, you're going to see vastly different
> >>> number of page accesses. While that's obviously something worth taking into
> >>> account for planning purposes, I don't think it'd be properly done by the
> >>> random_page_cost itself.
> >>>
> >>
> >> I don't understand your argument. Yes, the index scan does way more page
> >> accesses. Essentially, each index tuple reads a new page - and without a
> >> cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap
> >> page, which means we have to read it ~21x. In other words, it reads as
> >> many pages as there are tuples.
> >>
> >> This is why my formulas divide the seqscan timing by number of pages,
> >> but indexscan is divided by number of tuples. And AFAIK the costing does
> >> exactly that too. So the random_page_cost doesn't need to do anything
> >> special, we estimate the number of page reads. (And at least in the case
> >> of random table it's pretty accurate.)
> >
> > The thing is that your queries have vastly different timings *regardless* of
> > IO. Here's an example, fully cached:
> >
> > postgres[2718803][1]=# EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM test_table) OFFSET 100000000;
> > ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> > │ QUERY PLAN │
> > ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> > │ Limit (cost=288096.16..288096.16 rows=1 width=41) (actual time=392.066..392.067 rows=0.00 loops=1) │
> > │ Buffers: shared hit=238096 │
> > │ -> Seq Scan on test_table (cost=0.00..288096.16 rows=5000016 width=41) (actual time=0.032..269.031 rows=5000000.00 loops=1) │
> > │ Buffers: shared hit=238096 │
> > │ Planning Time: 0.087 ms │
> > │ Execution Time: 392.082 ms │
> > └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> > (6 rows)
> >
> > postgres[2759534][1]=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM (SELECT ctid,* FROM test_table ORDER BY id) offset 100000000;
> > ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> > │ QUERY PLAN │
> > ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> > │ Limit (cost=1082201.28..1082201.28 rows=1 width=47) (actual time=3025.902..3025.903 rows=0.00 loops=1) │
> > │ Buffers: shared hit=5013643 │
> > │ -> Index Scan using test_table_id_idx on test_table (cost=0.43..1082201.28 rows=5000016 width=47) (actual time=0.035..2907.699 rows=5000000.00 loops=1) │
> > │ Index Searches: 1 │
> > │ Buffers: shared hit=5013643 │
> > │ Planning Time: 0.131 ms │
> > │ Execution Time: 3025.930 ms │
> > └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> > (7 rows)
> >
> > You have a ~8x difference in time, even without a single IO. A lot of that is
> > probably just the 20x difference in buffer accesses, but also the index path
> > just being a lot less predictable for the CPU.
> >
>
> Right. And that's why the timing is divided by *pages* for sequential
> scans, and by *rows* for index scan. Which for the timings you show ends
> up doing this:
>
> seqscan = 392 / 238096 = 0.0016
> indexescan = 3025 / 5M = 0.0006
>
> So the per-page overhead with index scans is 1/3 of seqscans. AFAICS
> this directly contradicts your argument that random_page_cost ends up
> too high due to indexscans paying more per page.
I don't think that's contradicting it. The seqscan is doing ~21 tuples per
page, the indexscan ~1. There is more work happening for each sequentially
scanned page. And of course the number of pages pinned isn't the only factor.
Just look at a profile of the index query in the cached case:
- 99.98% 0.00% postgres postgres [.] standard_ExecutorRun
standard_ExecutorRun
- ExecProcNodeInstr
- 99.85% ExecLimit
- 99.26% ExecProcNodeInstr
- 98.06% ExecScan
- 92.13% IndexNext
- 91.07% index_getnext_slot
- 85.55% heapam_index_fetch_tuple
- 47.69% ReleaseAndReadBuffer
+ 42.17% StartReadBuffer
1.78% hash_bytes
1.63% UnpinBufferNoOwner
15.94% heap_hot_search_buffer
13.22% heap_page_prune_opt
+ 3.70% ExecStoreBufferHeapTuple
2.15% LWLockAcquire
1.33% LWLockRelease
- 3.96% index_getnext_tid
+ 3.44% btgettuple
+ 4.50% ExecInterpExpr
and compare that with the seqscan:
- 99.93% 0.00% postgres postgres [.] standard_ExecutorRun
standard_ExecutorRun
- ExecProcNodeInstr
- 98.64% ExecLimit
- 93.10% ExecProcNodeInstr
- 81.23% ExecSeqScan
- 70.30% heap_getnextslot
- 56.28% heapgettup_pagemode
- 30.99% read_stream_next_buffer
+ 29.16% StartReadBuffer
- 10.99% heap_prepare_pagescan
7.13% heap_page_prune_opt
1.08% LWLockAcquire
1.42% LWLockRelease
1.13% ReleaseBuffer
- 8.28% ExecStoreBufferHeapTuple
2.00% UnpinBufferNoOwner
2.26% MemoryContextReset
1.31% heapgettup_pagemode
1.27% ExecStoreBufferHeapTuple
3.16% InstrStopNode
2.82% InstrStartNode
1.37% heap_getnextslot
1.00% InstrStartNode
1.30% ExecProcNodeInstr
> I'd bet these costs to be pretty negligible compared to the cost of
> actual I/O. So I don't see how would this explain the timing difference
> with cold data.
Of course there's an increased cost of random IO, I never doubted that! I'm
just saying that your method of measurement seems to over-estimate it.
> > I don't think random/seq_page_cost should account for difference in processing
> > costs that aren't actually related to it being a sequential or a random IO.
> >
>
> Perhaps. If it's not directly related to randomness of I/O, then maybe
> random_page_cost is not the right cost parameter.
>
> But it needs to be included *somewhere*, right? And it's per-page cost,
> and per the experiments there's very clear difference between sequential
> and random access. So why wouldn't random_page_cost be a good fit?
We are already accounting for the increased CPU cost for index scans etc to
some degree, via cpu_* costs. If we under-estimate the CPU cost of index scans
we should fix that, regardless of random_page_cost, as a well correlated index
scan *still* is a lot more expensive than a sequential scan.
> >>> I think doing this kind of measurement via normal SQL query processing is
> >>> almost always going to have too much other influences. I'd measure using fio
> >>> or such instead. It'd be interesting to see fio numbers for your disks...
> >>>
> >>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5 --ioengine pvsync --iodepth 1
> >>> vs --rw randread
> >>>
> >>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
> >>>
> >>
> >> I can give it a try. But do we really want to strip "our" overhead with
> >> reading data?
> >
> > I think the relevant differences in efficiency here don't come from something
> > inherent in postgres, it's that you're comparing completely different code
> > paths that have different IO independent costs. You could totally measure it
> > in postgres, just not easily via "normal" SQL queries on tables/indexes.
> >
>
> If the two paths for seq/index access have so vastly different
> overheads, why shouldn't that be reflected in the seq_page_cost and
> random_page_cost?
No, because a) random_page_cost is applied to other things than index access
b) some index scans are not dominated by random accesses.
> > You can do a decent approximation of this via postgres too, by utilizing
> > pg_prewarm and/or pageinspect. E.g. something like
> >
> > random IO:
> > SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0, (pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY random(), blkno);
> > and for sequential IO
> > SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0, (pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY blkno, random());
> >
> > Time for sequential IO: 8066.720 ms
> > Time for random IO: 31623.122 ms
> >
> > That's, almost eerily, close to 4x :)
> >
> >
> > FWIW, when cached, the difference in runtime between the two variants is
> > neglegible.
> >
>
> The problem is the sequential query forces the I/O blocks to be 8kB
> each, instead of issuing larger requests (like a regular seqscan). On my
> laptops that makes the seqscan 10x slower, so for regular seqscan the
> ratio is way higher than 4x.
>
> Maybe this the kind of overhead you claim should not be considered when
> setting random_page_cost, but I'm not convinced of that.
A correlated index scan today will not do IO combining, despite being
accounted as seq_page_cost. So just doing individual 8kB IOs actually seems to
be the appropriate comparison. Even with table fetches in index scans doing
IO combining as part by your work, the reads of the index data itself won't be
combined. And I'm sure other things won't be either.
I'd be on-board with trying to improve the cost accounting to take IO
combining into account in some form. I don't quite know how, off-the-cuff: The
advantage of combining seems to quickly drop off after a few pages, but by how
much and where exactly seems to very heavily depend on the specific
hardware. Just dividing the number of sequential accesses by io_combine_limit
seems like it'd be over-estimating the effect substantially.
I do wonder if we ought to split off the CPU cost associated with both
sequential and random_page_cost into a cpu_page_cost or such. Right now we
IIRC largely disregard the cost of accessing some pages repeatedly, just
because we estimate that to not incur IO costs. But of course a plan that has
fewer pages accesses while otherwise doing the same amount of work will be
faster, the profiles further up clearly show that we spend a fair amount of
time in buffer handling even when fully cached.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-10-06 23:58:30 | Re: Add stats_reset to pg_stat_all_tables|indexes and related views |
Previous Message | Michael Paquier | 2025-10-06 23:11:04 | Re: Support getrandom() for pg_strong_random() source |