Re: Use compiler intrinsics for bit ops in hash

From: Jesse Zhang <sbjesse(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: David Fetter <david(at)fetter(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Use compiler intrinsics for bit ops in hash
Date: 2020-03-08 23:44:21
Message-ID: CAGf+fX6mCvz3xX7zzwtyyrcfXGbYOW=g-mncVN37St_Kc4wCeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi John,
Oops this email has been sitting in my outbox for 3 days...

On Wed, Mar 4, 2020 at 1:46 AM John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> wrote:
> On Tue, Mar 3, 2020 at 4:46 AM Jesse Zhang <sbjesse(at)gmail(dot)com> wrote:
> > I've quickly put together a PoC patch on top of yours, which
> > re-implements ceil_log2 using LZCNT coupled with a CPUID check.
> > Thoughts?
>
> This patch seems to be making an assumption that an indirect function
> call is faster than taking a branch (in inlined code) that the CPU
> will almost always predict correctly. It would be nice to have some
> numbers to compare. (against pg_count_leading_zeros_* using the "slow"
> versions but statically inlined).
>

Ah, how could I forget that... I ran a quick benchmark on my laptop, and
indeed, even though the GCC-generated code takes a hit on zero input
(Clang generates slightly different code that gives indistinguishable
runtime for zero and non-zero inputs), the inlined code (the function
input in my benchmark is never a constant literal so the branch does get
exercised at runtime) is still more than twice as fast as the function
call.

------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
BM_pfunc/0 1.57 ns 1.56 ns 447127265
BM_pfunc/1 1.56 ns 1.56 ns 449618696
BM_pfunc/8 1.57 ns 1.57 ns 443013856
BM_pfunc/64 1.57 ns 1.57 ns 448784369
BM_slow/0 0.602 ns 0.600 ns 1000000000
BM_slow/1 0.391 ns 0.390 ns 1000000000
BM_slow/8 0.392 ns 0.391 ns 1000000000
BM_slow/64 0.391 ns 0.390 ns 1000000000
BM_fast/0 1.47 ns 1.46 ns 477513921
BM_fast/1 1.47 ns 1.46 ns 473992040
BM_fast/8 1.46 ns 1.46 ns 474895755
BM_fast/64 1.47 ns 1.46 ns 477215268

For your amusement, I've attached the meat of the benchmark. To build
the code you can grab the repository at
https://github.com/d/glowing-chainsaw/tree/pfunc

> Stylistically, "8 * sizeof(num)" is a bit overly formal, since the
> hard-coded number we want is in the name of the function.

Oh yeah, overly generic code is indicative of the remnants of my C++
brain, will fix.

Cheers,
Jesse

Attachment Content-Type Size
lzcnt.cc text/x-c-code 378 bytes
lzcnt.h text/x-c-code 485 bytes
main.cc text/x-c-code 621 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-03-09 00:01:57 Re: Additional improvements to extended statistics
Previous Message Tom Lane 2020-03-08 22:48:33 Re: Add absolute value to dict_int