|From:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|To:||Jeremy Kerr <jk(at)ozlabs(dot)org>|
|Cc:||Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org|
|Subject:||Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
Jeremy Kerr <jk(at)ozlabs(dot)org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.
> This requires a new check for __builtin_clz in the configure script.
> Results in a ~2% performance increase on sysbench on PowerPC.
I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop. And there's a problem: on
x86_64 it is not much of a win. The code sequence that gcc generates
for __builtin_clz is basically
bsrl %eax, %eax
xorl $31, %eax
and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction. It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
What I found on both a couple-years-old Xeon and a pretty new Macbook
is that the __builtin_clz code is actually *slower* than the naive loop
for the fewest-iterations case (request size up to 16 bytes) and about
a breakeven at the next size category (up to 32). It doesn't start to
look like a useful win until you get to sizes up to 1KB or so.
Unfortunately PG spends a lot more time allocating little structs than
I suppose that it's more of a win on PPC (where probably __builtin_clz
is actually fast), but on the basis of these results it's going to be
no gain and maybe a loss on Intel.
The other problem is that this approach is gcc-specific, which seems
particularly bad for something that's only going to be a win on
non-Intel platforms; they're much more likely to be using non-gcc
I wonder whether it would work better to do manual unrolling of the
shift loop. That is, forget about the builtin and do something like
while (size >= 8)
idx += 4;
size >>= 4;
size >>= 1;
Obviously there's room for experimental optimization of the particular
shift strides we use here, but remember that the total shift count is
never going to exceed about 10, and will usually be quite a bit less.
I did some quick experimentation along these lines and couldn't really
persuade myself that unrolling was going to win much, but that's an
Anyway, based on what I'm seeing here I can't convince myself that this
is worth introducing a compiler dependency for.
regards, tom lane
|Next Message||Tom Lane||2009-07-20 17:00:12||Re: WIP: Deferrable unique constraints|
|Previous Message||Joshua D. Drake||2009-07-20 16:43:46||Re: hot standby - merged up to CVS HEAD|