Re: BitmapHeapScan streaming read user and prelim refactoring

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2024-03-01 01:18:20
Message-ID: CAAKRu_YXTOezK3h_YrNJrAUDuAzet59hD1bmmtH4zVPLC00HtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 2/29/24 23:44, Tomas Vondra wrote:
> >
> > ...
> >
> >>>
> >>> I do have some partial results, comparing the patches. I only ran one of
> >>> the more affected workloads (cyclic) on the xeon, attached is a PDF
> >>> comparing master and the 0001-0014 patches. The percentages are timing
> >>> vs. the preceding patch (green - faster, red - slower).
> >>
> >> Just confirming: the results are for uncached?
> >>
> >
> > Yes, cyclic data set, uncached case. I picked this because it seemed
> > like one of the most affected cases. Do you want me to test some other
> > cases too?
> >
>
> BTW I decided to look at the data from a slightly different angle and
> compare the behavior with increasing effective_io_concurrency. Attached
> are charts for three "uncached" cases:
>
> * uniform, work_mem=4MB, workers_per_gather=0
> * linear-fuzz, work_mem=4MB, workers_per_gather=0
> * uniform, work_mem=4MB, workers_per_gather=4
>
> Each page has charts for master and patched build (with all patches). I
> think there's a pretty obvious difference in how increasing e_i_c
> affects the two builds:

Wow! These visualizations make it exceptionally clear. I want to go to
the Vondra school of data visualizations for performance results!

> 1) On master there's clear difference between eic=0 and eic=1 cases, but
> on the patched build there's literally no difference - for example the
> "uniform" distribution is clearly not great for prefetching, but eic=0
> regresses to eic=1 poor behavior).

Yes, so eic=0 and eic=1 are identical with the streaming read API.
That is, eic 0 does not disable prefetching. Thomas is going to update
the streaming read API to avoid issuing an fadvise for the last block
in a range before issuing a read -- which would mean no prefetching
with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems
like the right behavior -- which would be different than what master
is doing, right?

Hopefully this fixes the clear difference between master and the
patched version at eic 0.

> 2) For some reason, the prefetching with eic>1 perform much better with
> the patches, except for with very low selectivity values (close to 0%).
> Not sure why this is happening - either the overhead is much lower
> (which would matter on these "adversarial" data distribution, but how
> could that be when fadvise is not free), or it ends up not doing any
> prefetching (but then what about (1)?).

For the uniform with four parallel workers, eic == 0 being worse than
master makes sense for the above reason. But I'm not totally sure why
eic == 1 would be worse with the patch than with master. Both are
doing a (somewhat useless) prefetch.

With very low selectivity, you are less likely to get readahead
(right?) and similarly less likely to be able to build up > 8kB IOs --
which is one of the main value propositions of the streaming read
code. I imagine that this larger read benefit is part of why the
performance is better at higher selectivities with the patch. This
might be a silly experiment, but we could try decreasing
MAX_BUFFERS_PER_TRANSFER on the patched version and see if the
performance gains go away.

> 3) I'm not sure about the linear-fuzz case, the only explanation I have
> we're able to skip almost all of the prefetches (and read-ahead likely
> works pretty well here).

I started looking at the data generated by linear-fuzz to understand
exactly what effect the fuzz was having but haven't had time to really
understand the characteristics of this dataset. In the original
results, I thought uncached linear-fuzz and linear had similar results
(performance improvement from master). What do you expect with linear
vs linear-fuzz?

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2024-03-01 01:29:12 Re: Infinite loop in XLogPageRead() on standby
Previous Message Jacob Champion 2024-03-01 01:08:01 Re: [PoC] Federated Authn/z with OAUTHBEARER