Re: BitmapHeapScan streaming read user and prelim refactoring

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Date: 2024-03-29 15:52:58
Message-ID: 3ac67b68-47ff-4edb-bb9e-d86504acd573@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/29/24 14:36, Thomas Munro wrote:
> On Sat, Mar 30, 2024 at 12:17 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> eic unpatched patched
>> 0 4172 9572
>> 1 30846 10376
>> 2 18435 5562
>> 4 18980 3503
>> 8 18980 2680
>> 16 18976 3233
>
> ... but the patched version gets down to a low number for eic=0 too if
> you turn up the blockdev --setra so that it also gets Linux RA
> treatment, making it the clear winner on all eic settings. Patched
> doesn't improve. So, for low IOPS storage at least, when you're on
> the borderline between random and sequential, ie bitmap with a lot of
> 1s in it, it seems there are cases where patched doesn't trigger Linux
> RA but unpatched does, and you can tune your way out of that, and then
> there are cases where the IOPS limit is reached due to small reads,
> but patched does better because of larger I/Os that are likely under
> the same circumstances. Does that make sense?

I think you meant "unpatched version gets down" in the first sentence,
right? Still, it seems clear this changes the interaction with readahead
done by the kernel.

However, you seem to focus only on eic=0/eic=1 cases, but IIRC that was
just an example. There are regression with higher eic values too.

I do have some early results from the benchmarks - it's just from the
NVMe machine, with 1M tables (~300MB), and it's just one incomplete run
(so there might be some noise etc.).

Attached is a PDF with charts for different subsets of the runs:

- optimal (would optimizer pick bitmapscan or not)
- readahead (yes/no)
- workers (serial vs. 4 workers)
- combine limit (8kB / 128kB)

The most interesting cases are first two rows, i.e. optimal plans.
Either with readahead enabled (first row) or disabled (second row).

Two observations:

* The combine limit seems to have negligible impact. There's no visible
difference between combine_limit=8kB and 128kB.

* Parallel queries seem to work about the same as master (especially for
optimal cases, but even for not optimal ones).

The optimal plans with kernel readahead (two charts in the first row)
look fairly good. There are a couple regressed cases, but a bunch of
faster ones too.

The optimal plans without kernel read ahead (two charts in the second
row) perform pretty poorly - there are massive regressions. But I think
the obvious reason is that the streaming read API skips prefetches for
sequential access patterns, relying on kernel to do the readahead. But
if the kernel readahead is disabled for the device, that obviously can't
happen ...

I think the question is how much we can (want to) rely on the readahead
to be done by the kernel. Maybe there should be some flag to force
issuing fadvise even for sequential patterns, perhaps at the tablespace
level? I don't recall seeing a system with disabled readahead, but I'm
sure there are cases where it may not really work - it clearly can't
work with direct I/O, but I've also not been very successful with
prefetching on ZFS.

The non-optimal plans (second half of the charts) shows about the same
behavior, but the regressions are more frequent / significant.

I'm also attaching results for the 5k "optimal" runs, showing the timing
for master and patched build, sorted by (patched/master). The most
significant regressions are with readahead=0, but if you filter that out
you'll see the regressions affect a mix of data sets, not just the
uniformly random data used as example before.

On 3/29/24 12:17, Thomas Munro wrote:
> ...
> On the other hand this might be a pretty unusual data distribution.
> People who store random numbers or hashes or whatever probably don't
> really search for ranges of them (unless they're trying to mine
> bitcoins in SQL). I dunno. Maybe we need more realistic tests, or
> maybe we're just discovering all the things that are bad about the
> pre-existing code.

I certainly admit the data sets are synthetic and perhaps adversarial.
My intent was to cover a wide range of data sets, to trigger even less
common cases. It's certainly up to debate how serious the regressions on
those data sets are in practice, I'm not suggesting "this strange data
set makes it slower than master, so we can't commit this".

But I'd also point that what matters is the access pattern, not the
exact query generating it. I agree people probably don't do random
numbers or hashes with range conditions, but that's irrelevant - what
it's all about the page access pattern. If you have IPv4 addresses and
query that, that's likely going to be pretty random, for example.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
run-randomized.sh application/x-shellscript 9.2 KB
results.pdf application/pdf 252.0 KB
results.txt text/plain 44.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-03-29 15:59:40 Re: Popcount optimization using AVX512
Previous Message Nathan Bossart 2024-03-29 15:42:41 Re: Popcount optimization using AVX512