Re: BitmapHeapScan streaming read user and prelim refactoring

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2024-03-17 19:36:29
Message-ID: 1d9902c0-43b9-4ae7-8c19-70bab3802bd1@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/17/24 17:38, Andres Freund wrote:
> Hi,
>
> On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote:
>> On 3/16/24 20:12, Andres Freund wrote:
>>> That would address some of the worst behaviour, but it doesn't really seem to
>>> address the underlying problem of the two iterators being modified
>>> independently. ISTM the proper fix would be to protect the state of the
>>> iterators with a single lock, rather than pushing down the locking into the
>>> bitmap code. OTOH, we'll only need one lock going forward, so being economic
>>> in the effort of fixing this is also important.
>>>
>>
>> Can you share some details about how you identified the problem, counted
>> the prefetches that happen too late, etc? I'd like to try to reproduce
>> this to understand the issue better.
>
> There's two aspects. Originally I couldn't reliably reproduce the regression
> with Melanie's repro on my laptop. I finally was able to do so after I
> a) changed the block device's read_ahead_kb to 0
> b) used effective_io_concurrency=1
>
> That made the difference between the BitmapAdjustPrefetchIterator() locations
> very significant, something like 2.3s vs 12s.
>

Interesting. I haven't thought about read_ahead_kb, but in hindsight it
makes sense it affects these cases. OTOH I did not set it to 0 on either
machine (the 6xSATA RAID0 has it at 12288, for example) and yet that's
how we found the regressions.

For eic it makes perfect sense that setting it to 1 is particularly
vulnerable to this issue - it only takes a small "desynchronization" of
the two iterators for the prefetch to "fall behind" and frequently
prefetch blocks we already read.

> Besides a lot of other things, I finally added debugging fprintfs printing the
> pid, (prefetch, read), block number. Even looking at tiny excerpts of the
> large amount of output that generates shows that two iterators were out of
> sync.
>

Thanks. I did experiment with fprintf, but it's quite cumbersome, so I
was hoping you came up with some smart way to trace this king of stuff.
For example I was wondering if ebpf would be a more convenient way.

>
>> If I understand correctly, what may happen is that a worker reads blocks
>> from the "prefetch" iterator, but before it manages to issue the
>> posix_fadvise, some other worker already did pread. Or can the iterators
>> get "out of sync" in a more fundamental way?
>
> I agree that the current scheme of two shared iterators being used has some
> fairly fundamental raciness. But I suspect there's more than that going on
> right now.
>
> Moving BitmapAdjustPrefetchIterator() to later drastically increases the
> raciness because it means table_scan_bitmap_next_block() happens between
> increasing the "real" and the "prefetch" iterators.
>
> An example scenario that, I think, leads to the iterators being out of sync,
> without there being races between iterator advancement and completing
> prefetching:
>
> start:
> real -> block 0
> prefetch -> block 0
> prefetch_pages = 0
> prefetch_target = 1
>
> W1: tbm_shared_iterate(real) -> block 0
> W2: tbm_shared_iterate(real) -> block 1
> W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0
> W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1
> W1: read block 0
> W2: read block 1
> W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) -> 2, prefetch block 2
> W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target
>
> W1: tbm_shared_iterate(real) -> block 2
> W2: tbm_shared_iterate(real) -> block 3
>
> W2: BitmapAdjustPrefetchIterator() -> prefetch_pages--
> W2: read block 3
> W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, prefetch block 3
>
> So afaict here we end up prefetching a block that the *same process* just had
> read.
>

Uh, that's very weird. I'd understood if there's some cross-process
issue, but if this happens in a single process ... strange.

> ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(),
> separately from advancing the "real" iterator, is pretty ugly for non-parallel
> BHS and just straight up broken in the parallel case.
>

Yeah, I agree with the feeling it's an ugly fix. Definitely seems more
like fixing symptoms than the actual problem.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-03-17 19:39:19 Re: Improving EXPLAIN's display of SubPlan nodes
Previous Message Nathan Bossart 2024-03-17 19:29:27 Re: Make attstattarget nullable