Re: Re: CRC

From: Bruce Guenter <bruceg(at)em(dot)ca>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: CRC
Date: 2000-12-10 06:24:17
Message-ID: 20001210002417.Q9706@em.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 09, 2000 at 06:46:23PM -0500, Tom Lane wrote:
> Bruce Guenter <bruceg(at)em(dot)ca> writes:
> > The difference is likely because PA-RISC (like most other RISC
> > architectures) lack a "roll" opcode that is very prevalent in the MD5
> > algorithm.
>
> A good theory, but unfortunately not a correct theory. PA-RISC can do a
> circular shift in one cycle using the "shift right double" instruction,
> with the same register specified as both high and low halves of the
> 64-bit operand. And gcc does know about that.

Interesting. I was under the impression that virtually no RISC CPU had
a rotate instruction. Do any others?

> After some groveling through assembly code, it seems that the CRC-32
> implementation is about as tight as it could get: two loads, two XORs,
> and two EXTRU's per byte (one used to implement the right shift, the
> other to implement masking with 0xFF).

Same with the x86 core:
movb %dl,%al
xorb (%ecx),%al
andl $255,%eax
shrl $8,%edx
incl %ecx
xorl (%esi,%eax,4),%edx

> And the wall clock timing does
> indeed come out to just about six cycles per byte.

On my Celeron, the timing for those six opcodes is almost whopping 13
cycles per byte. Obviously there's some major performance hit to do the
memory instructions, because there's no more than 4 cycles worth of
dependant instructions in that snippet.

BTW, for reference, P3 timings are almost identical to those of the
Celeron, so it's not causing problems outside the built-in caches common
to the two chips.

> The MD5 code also
> looks pretty tight. Each basic OP requires either two or three logical
> operations (and/or/xor/not) depending on which round you're looking at,
> plus four additions and a circular shift. PA-RISC needs two cycles to
> load an arbitrary 32-bit constant, but other than that I see no wasted
> cycles here:
>
> ldil L'-1444681467,%r20
> xor %r3,%r14,%r19
> ldo R'-1444681467(%r20),%r20
> and %r1,%r19,%r19
> addl %r15,%r20,%r20
> xor %r14,%r19,%r19
> addl %r19,%r26,%r19
> addl %r20,%r19,%r15
> shd %r15,%r15,27,%r15
> addl %r15,%r3,%r15

Here's the x86 assembly code for what appears to be the same basic OP:
movl %edx,%eax
xorl %esi,%eax
andl %edi,%eax
xorl %esi,%eax
movl -84(%ebp),%ecx
leal -1444681467(%ecx,%eax),%eax
addl %eax,%ebx
roll $5,%ebx
addl %edx,%ebx
This is a couple fewer instructions, mainly saving on doing any loads to
use the constant value. This takes almost exactly 9 cycles per byte.

> There are 64 of these basic OPs needed in each round, and each round
> processes 64 input bytes, so basically you can figure one OP per byte.
> Ignoring loop overhead and so forth, it's nine or ten cycles per byte
> for MD5 versus six for CRC.

On Celeron/P3, CRC scores almost 13 cycles per byte, MD4 is about 6
cycles per byte, and MD5 is about 9 cycles per byte. On Pentium MMX,
CRC is 7.25, MD4 is 7.5 and MD5 is 10.25. So, the newer CPUs actually
do worse on CRC than the older ones do. Weirder and weirder.

> I'm at a loss to see how a Pentium would arrive at a better result for
> MD5 than for CRC. For one thing, it's going to be at a disadvantage
> because it hasn't got enough registers.

I agree. It would appear that the table lookup is causing a major
bubble in the pipelines on the newer Celeron/P2/P3 CPUs.
--
Bruce Guenter <bruceg(at)em(dot)ca> http://em.ca/~bruceg/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Guenter 2000-12-10 06:35:54 Re: CRC, hash & Co.
Previous Message Bruce Guenter 2000-12-10 05:37:42 Re: Re: CRC