| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Small and unlikely overflow hazard in bms_next_member() |
| Date: | 2026-04-03 11:42:17 |
| Message-ID: | CAApHDvocDY0qvhxRNBaNRjEr46j8TsGFig+=6bHN-NYS_dhSCw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, 3 Apr 2026 at 16:24, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> I also did a load test with a standalone c program with 4 versions:
I think you'd be better off picking a more realistic Bitmapset. The
code is doing:
int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
sizeof(bitmapword));
bms->nwords = words_to_alloc;
memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
/* Set a bit far into the set to force a long scan */
int target_bit = (words_to_alloc - 1) * 64 + 10;
bms->words[words_to_alloc - 1] |= (1ULL << 10);
A set with 20k words! Then you're doing:
for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
So you're not looping correctly over the set, as looping over a set
with 1 member requires 2 calls to the function.
I'd say this unrealistically favours the unrolled loop version, as
most of the effort is in the loop looping over empty words and you can
do that without the extra bit-wise AND that the other versions have.
That version also seems to apply both the 64-bit fix and the INT_MAX
fix. Why both? Because you're only calling the function once per
iteration, it also means your || INT_MAX is only executed once, rather
than n_members + 1 as it would normally be executed. That'll make your
version seem better than it is.
I fixed all that and changed the Bitmapset to something more realistic
with how it's more likely to be seen in PostgreSQL and ran the
benchmark with a proper bms_next_member loop. File attached and
results below:
Zen2 Linux
$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...
Original: 0.62113 seconds
David's: 0.70257 seconds
Chao's version: 1.70691 seconds
Unrolled loop: 1.56239 seconds
2800000000
$ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...
Original: 1.11263 seconds
David's: 0.87399 seconds
Chao's version: 0.52962 seconds
Unrolled loop: 1.07799 seconds
2800000000
Apple M2:
Benchmarking 100000000 iterations...
Original: 0.64023 seconds
David's: 0.41087 seconds
Chao's version: 1.21113 seconds
Unrolled loop: 0.55924 seconds
2800000000
> I’m curious that, when something performs differently across platforms, which platform should take priority?
I don't think it matters too much for this case. If we were actually
working on a function that was a bottleneck, then it would be worth
looking deeper and seeing if the compilers are producing suboptimal
code and then try to coax them into making better code. As a last
resort, have two versions and some preprocessor to decide which one.
However, this just isn't performance-critical enough to warrant such
an effort. I think we're already going far too far with this. I just
want to fix the bug. If performance increases in some way as a result
of that, then good.
If I sum up the times from each version of the above results, I get:
Original: 2.37399
David's: 1.98743
Chao's: 3.44766
Unrolled: 3.19962
I'm happy to go with the fastest one. I expect the one with the fewest
instructions is likely more important when running it in the planner,
as that's less to decode. I didn't look, but I expect your loop
unrolled version and your || INT_MAX version to have more
instructions.
David
| Attachment | Content-Type | Size |
|---|---|---|
| test_bms_next2.c | text/plain | 5.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Daniil Davydov | 2026-04-03 11:43:33 | Re: POC: Parallel processing of indexes in autovacuum |
| Previous Message | Nazir Bilal Yavuz | 2026-04-03 11:37:47 | Re: pg_waldump: support decoding of WAL inside tarfile |