Re: BitmapHeapScan streaming read user and prelim refactoring

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2024-03-18 11:34:23
Message-ID: 059b1665-38aa-450f-a7a0-0bdd37104069@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/17/24 20:36, Tomas Vondra wrote:
>
> ...
>
>> Besides a lot of other things, I finally added debugging fprintfs printing the
>> pid, (prefetch, read), block number. Even looking at tiny excerpts of the
>> large amount of output that generates shows that two iterators were out of
>> sync.
>>
>
> Thanks. I did experiment with fprintf, but it's quite cumbersome, so I
> was hoping you came up with some smart way to trace this king of stuff.
> For example I was wondering if ebpf would be a more convenient way.
>

FWIW I just realized why I failed to identify this "late prefetch" issue
during my investigation. I was experimenting with instrumenting this by
adding a LD_PRELOAD library, logging all pread/fadvise calls. But the
FilePrefetch call is skipped in the page is already in shared buffers,
so this case "disappeared" during processing which matched the two calls
by doing an "inner join".

That being said, I think tracing this using LD_PRELOAD or perf may be
more convenient way to see what's happening. For example I ended up
doing this:

perf record -a -e syscalls:sys_enter_fadvise64 \
-e syscalls:sys_exit_fadvise64 \
-e syscalls:sys_enter_pread64 \
-e syscalls:sys_exit_pread64

perf script -ns

Alternatively, perf-trace can be used and prints the filename too (but
time has ms resolution only). Processing this seems comparable to the
fprintf approach.

It still has the issue that some of the fadvise calls may be absent if
the prefetch iterator gets too far behind, but I think that can be
detected / measured by simply counting the fadvise calls, and comparing
them to pread calls. We expect these to be about the same, so

(#pread - #fadvise) / #fadvise

is a measure of how many were "late" and skipped.

It also seems better than fprintf because it traces the actual syscalls,
not just calls to glibc wrappers. For example I saw this

postgres 54769 [001] 33768.771524828:
syscalls:sys_enter_pread64: ..., pos: 0x30d04000

postgres 54769 [001] 33768.771526867:
syscalls:sys_exit_pread64: 0x2000

postgres 54820 [000] 33768.771527473:
syscalls:sys_enter_fadvise64: ..., offset: 0x30d04000, ...

postgres 54820 [000] 33768.771528320:
syscalls:sys_exit_fadvise64: 0x0

which is clearly a case where we issue fadvise after pread of the same
block already completed.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2024-03-18 11:35:18 Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
Previous Message Alexander Korotkov 2024-03-18 11:24:47 Re: Add pg_basetype() function to obtain a DOMAIN base type