Re: Streaming read-ready sequential scan code

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Streaming read-ready sequential scan code
Date: 2024-04-05 04:14:53
Message-ID: CA+hUKGKXZALJ=6aArUsXRJzBm=qvc4AWp7=iJNXJQqpbRLnD_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Yeah, I plead benchmarking myopia, sorry. The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does. Oops.
With the attached minor rearrangement, it works fine. I also poked
some more at that memory prefetcher. Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).

create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 560000) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;

select count(*) from t;

cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm

cold warm hot
master 2479ms 886ms 200ms
seqscan 2498ms 716ms 211ms <-- regression
seqscan + fastpath 2493ms 711ms 200ms <-- fixed, I think?
seqscan + memprefetch 2499ms 716ms 182ms
seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/

Cold has no difference. That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:

cold
master 1775ms
seqscan + fastpath + memprefetch 1700ms

Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).

The interesting column is hot. The 200ms->211ms regression is due to
the extra bookkeeping in the slow path. The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can
you reproduce that?

The memory prefetching trick, on top of that, seems to be a good
optimisation so far. Note that that's not an entirely independent
trick, it's something we can only do now that we can see into the
future; it's the next level up of prefetching, worth doing around 60ns
before you need the data I guess. Who knows how thrashed the cache
might be before the caller gets around to accessing that page, but
there doesn't seem to be much of a cost or downside to this bet. We
know there are many more opportunities like that[1] but I don't want
to second-guess the AM here, I'm just betting that the caller is going
to look at the header.

Unfortunately there seems to be a subtle bug hiding somewhere in here,
visible on macOS on CI. Looking into that, going to find my Mac...

[1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com

Attachment Content-Type Size
v10-0001-Use-streaming-I-O-in-heapam-sequential-scan.patch text/x-patch 7.0 KB
v10-0002-Improve-read_stream.c-s-fast-path.patch text/x-patch 4.8 KB
v10-0003-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch text/x-patch 4.7 KB
v10-0004-Prefetch-page-header-memory-when-streaming-relat.patch text/x-patch 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-04-05 04:15:43 Re: Popcount optimization using AVX512
Previous Message shveta malik 2024-04-05 04:13:35 Re: Synchronizing slots from primary to standby