From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> |
Cc: | Jeremy Kerr <jk(at)ozlabs(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex |
Date: | 2009-07-20 19:25:25 |
Message-ID: | 25859.1248117925@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> and it turns out that Intel hasn't seen fit to put a lot of effort into
>> the BSR instruction. It's constant time, all right, but on most of
>> their CPUs that constant time is like 8 or 16 times slower than an ADD;
>> cf http://www.intel.com/Assets/PDF/manual/248966.pdf
> hmm interesting - I don't have the exact numbers any more but that
> patch(or a previous version of it) definitly showed a noticable
> improvement when I tested with sysbench on a current generation Intel
> Nehalem...
Hmm. I may be overestimating the importance of the smaller size
categories. To try to get some trustworthy numbers, I made a quick-hack
patch (attachment #1) to count the actual numbers of calls to
AllocSetFreeIndex, and measured the totals for a run of the regression
tests on both a 32-bit machine and a 64-bit machine. On 32 I got
these totals:
0 5190113
1 5663980
2 3573261
3 4476398
4 4246178
5 1100634
6 386501
7 601654
8 44884
9 52372
10 202801
and on 64 these:
0 2139534
1 5994692
2 5711479
3 3289034
4 4550931
5 2573389
6 487566
7 588470
8 155148
9 52750
10 202597
If you want to do the same in some other workload, feel free. I
wouldn't trust the results from a single-purpose benchmark too much,
though.
I then put together a test harness that exercises AllocSetFreeIndex
according to these distributions (attachments #2,#3). (Note: I found
out that it's necessary to split the test into two files --- otherwise
gcc will inline AllocSetFreeIndex and partially const-fold the work,
leading to skewed results.)
What I'm seeing with this harness on my x86 machines is that
__builtin_clz is indeed a bit faster than a naive loop, but not by
very much --- it saves maybe 25% of the runtime. It's better on an
old PPC Mac; saves about 50%. Still, these are not impressive numbers
for a microbenchmark that is testing *only* AllocSetFreeIndex.
I'm still interested in the idea of doing a manual unroll instead of
relying on a compiler-specific feature. However, some quick testing
didn't find an unrolling that helps much.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
unknown_filename | text/plain | 980 bytes |
unknown_filename | text/plain | 936 bytes |
unknown_filename | text/plain | 720 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2009-07-20 19:29:09 | Re: MIN/MAX optimization for partitioned table |
Previous Message | Joshua Brindle | 2009-07-20 19:21:05 | Re: [PATCH] SE-PgSQL/tiny rev.2193 |