| From: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me>, Peter Geoghegan <pg(at)bowt(dot)ie> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Subject: | Re: index prefetching |
| Date: | 2025-12-29 13:37:24 |
| Message-ID: | 6ba8a743-2b58-4c5a-8a39-28ccc130a069@garret.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 29/12/2025 1:53 AM, Tomas Vondra wrote:
> It seems this is due to sending an extra SET (for the new GUC) in the
> pgbench script, which is recognized only on the v5+threshold build.
>
> That's a thinko on my side, I should have realized the extra command
> might affect this. It doesn't really affect the behavior, because 10 is
> the default value for read_stream_threshold. I've fixed the script, will
> check fresh results tomorrow.
>
> Still, I think most of what I said about heuristics when to initialize
> the read stream, and the risk/benefit tradeoff, still applies.
I did a lot of experiments this morning but could not find any
noticeable difference at any configuration when all working set fits in
shared buffers.
And frankly speaking after more thinking I do not see good reasons which
can explain such difference.
Just initialization of read stream should not add much overhead - it
seems to be not expensive operation.
What is actually matter is async IO. Without read stream, Postgres reads
heap pages using sync operation: backend just calls pread.
With read stream, AIO is used. By default "worker" AIO mode is used, it
means that backend sends request to one of the workers and wait for it's
completion. Worker receives request, performs IO and notifies backend.
Such interprocess communication adds significant overhead and this is
why if we initialize read stream from the very beginning, then we get
about ~4x worse performance with LIMIT 1.
Please correct me if I wrong (or it is Mac specific), but it is not
caused by any overhead related with read_stream, but by AIO.
I have not made such experiment, but it seems to me that if we make read
stream to perform sync calls, then there will be almost no difference in
performance.
When all data is cached in shared buffers, then we do not perform IO at all.
It means there it doesn't matter whether and when we initialize read_stream.
We can do it after processing 10 items (current default), or immediately
- it should not affect performance.
And this is what I have tested: performance actually not depends on
`read_stream_threshold` (if data fits in shared buffers).
At least it is within few percents and may be it is just random
fluctuations.
Obviously there is no 25% degradation.
It definitely doesn't mean that it is not possible to find scenario
where this approach with enabling prefetch after processing N items will
show worse performance than master or v5. We just need to properly
choose cache hit rate. But the same is true IMHO for v5 itself: it is
possible to find workload where it will show the same degradation
comparing with master.
More precise heuristic should IMHO take in account actual number of
performed disk read.
Please notice that I do not want to predict number of disk reads - i.e.
check if candidates for prefetch are present in shared buffers.
It will really adds significant overhead. I think that it is better to
use as threshold number of performed reads.
Unfortunately looks like it is not possible to accumulate such
information without changing other Postgres code.
For example, if `ReadBuffer` can somehow inform caller that it actually
performs read, then it can be easily calculate number of reads in
`heapam_index_fetch_tuple`:
```
static pg_attribute_always_inline Buffer
ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy,
bool* fast_path)
{
...
if (StartReadBuffer(&operation,
&buffer,
blockNum,
flags))
{
WaitReadBuffers(&operation);
*fast_path = false;
}
else
*fast_path = true;
return buffer;
}
It can be certainly achieved without changed ReadBuffer* family, just by
directly calling StartReadBuffer
from `heapam_index_fetch_tuple` instead of `ReadBuffer`.
Not so nice because we have to duplicate some bufmgr code. Not so much -
check for local relation and
filling `ReadBuffersOperation`structure. But it is better to avoid it.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Daniil Davydov | 2025-12-29 13:39:36 | Wrong comment for ReplicationSlotCreate |
| Previous Message | Kirill Reshke | 2025-12-29 12:51:21 | Re: GIN pageinspect support for entry tree and posting tree |