Re: Small and unlikely overflow hazard in bms_next_member()

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Small and unlikely overflow hazard in bms_next_member()
Date: 2026-04-03 13:28:03
Message-ID: D8422C13-E995-4D0D-B296-5D643DE89399@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Apr 3, 2026, at 19:42, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>
> On Fri, 3 Apr 2026 at 16:24, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>> I also did a load test with a standalone c program with 4 versions:
>
> I think you'd be better off picking a more realistic Bitmapset. The
> code is doing:
>
> int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
> Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
> sizeof(bitmapword));
> bms->nwords = words_to_alloc;
> memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
>
> /* Set a bit far into the set to force a long scan */
> int target_bit = (words_to_alloc - 1) * 64 + 10;
> bms->words[words_to_alloc - 1] |= (1ULL << 10);
>
> A set with 20k words! Then you're doing:
>
> for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
>
> So you're not looping correctly over the set, as looping over a set
> with 1 member requires 2 calls to the function.
>
> I'd say this unrealistically favours the unrolled loop version, as
> most of the effort is in the loop looping over empty words and you can
> do that without the extra bit-wise AND that the other versions have.
> That version also seems to apply both the 64-bit fix and the INT_MAX
> fix. Why both? Because you're only calling the function once per
> iteration, it also means your || INT_MAX is only executed once, rather
> than n_members + 1 as it would normally be executed. That'll make your
> version seem better than it is.
>
> I fixed all that and changed the Bitmapset to something more realistic
> with how it's more likely to be seen in PostgreSQL and ran the
> benchmark with a proper bms_next_member loop. File attached and
> results below:
>
> Zen2 Linux
>
> $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
>
> Original: 0.62113 seconds
> David's: 0.70257 seconds
> Chao's version: 1.70691 seconds
> Unrolled loop: 1.56239 seconds
> 2800000000
>
> $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
>
> Original: 1.11263 seconds
> David's: 0.87399 seconds
> Chao's version: 0.52962 seconds
> Unrolled loop: 1.07799 seconds
> 2800000000
>
> Apple M2:
>
> Benchmarking 100000000 iterations...
>
> Original: 0.64023 seconds
> David's: 0.41087 seconds
> Chao's version: 1.21113 seconds
> Unrolled loop: 0.55924 seconds
> 2800000000
>
>> I’m curious that, when something performs differently across platforms, which platform should take priority?
>
> I don't think it matters too much for this case. If we were actually
> working on a function that was a bottleneck, then it would be worth
> looking deeper and seeing if the compilers are producing suboptimal
> code and then try to coax them into making better code. As a last
> resort, have two versions and some preprocessor to decide which one.
> However, this just isn't performance-critical enough to warrant such
> an effort. I think we're already going far too far with this. I just
> want to fix the bug. If performance increases in some way as a result
> of that, then good.
>
> If I sum up the times from each version of the above results, I get:
>
> Original: 2.37399
> David's: 1.98743
> Chao's: 3.44766
> Unrolled: 3.19962
>
> I'm happy to go with the fastest one. I expect the one with the fewest
> instructions is likely more important when running it in the planner,
> as that's less to decode. I didn't look, but I expect your loop
> unrolled version and your || INT_MAX version to have more
> instructions.
>
> David
> <test_bms_next2.c>

In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win:
```
chaol(at)ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...

Original: 0.61559 seconds
David's: 0.61651 seconds
Chao's version: 1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```

(I also checked on Godbolt, from the instruction-count perspective, the unrolled version actually looks the worst.)

I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic.

It was fun working on this patch, and thank you for your patience.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2026-04-03 13:36:31 Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part
Previous Message Ashutosh Bapat 2026-04-03 13:12:58 Re: Better shared data structure management and resizable shared data structures