Re: Faster StrNCpy

From: "Strong, David" <david(dot)strong(at)unisys(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster StrNCpy
Date: 2006-10-02 16:06:35
Message-ID: B6419AF36AC8524082E1BC17DA2506E802579E2C@USMV-EXCH2.na.uis.unisys.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Mark,

Thanks for attaching the C code for your test. I ran a few tests on a 3Ghz Intel Xeon Paxville (dual core) system. I hope the formatting of this table survives:


Method Size N=1024*1024 N=1

MEMCPY 63 6964927 us 582494 us
MEMCPY 32 7102497 us 582467 us
MEMCPY 16 7116358 us 582538 us
MEMCPY 8 6965239 us 582796 us
MEMCPY 4 6964722 us 583183 us

STRNCPY 63 10131174 us 8843010 us
STRNCPY 32 10648202 us 9563868 us
STRNCPY 16 9187398 us 7969947 us
STRNCPY 8 9275353 us 8042777 us
STRNCPY 4 9067570 us 8058532 us

STRLCPY 63 15045507 us 14379702 us
STRLCPY 32 8960303 us 8120471 us
STRLCPY 16 7393607 us 4915457 us
STRLCPY 8 7222983 us 3211931 us
STRLCPY 4 7181267 us 1725546 us

LENCPY 63 7608932 us 4416602 us
LENCPY 32 7252849 us 3807535 us
LENCPY 16 11680927 us 10331487 us
LENCPY 8 10409685 us 9660616 us
LENCPY 4 10824632 us 9525082 us


The first column is the copy method, the second column is the source string size (size of -DSTRING), the 3rd and 4th columns are the different -DN settings.

The memcpy () call is the clear winner, at all source string sizes. The strncpy () call is better than strlcpy (), until the size of the string decreases. This is probably due to the zero padding effect of strncpy. The lencpy () call starts out strong, but degrades as the size of the string decreases. This was a little surprising and I don't have an explanation for this behavior at this time.

The AMD optimization manuals have some interesting examples for optimizations for memcpy, along the lines of cache line copies and prefetching:


http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF#search=%22amd%20optimization%20manual%22


h <http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22> ttp://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22


There also used to be an interesting article on the SGI web site called "Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540", but this seems to have been pulled. I did find a copy of the article here:


http://eunchul.com/database/board/cat.php?data=Win32_API&board_group=D42a8ff5c3a9b9


Obviously, different copy mechanisms suit different data sizes. So, I added a little debug to the strlcpy () function that was added to Postgres the other day. I ran a test against Postgres for ~15 minutes that used 2 client backends and the BG writer - 8330804 calls to strlcpy () were generated by the test.

Out of the 8330804 calls, 6226616 calls used a maximum copy size of 2213 bytes e.g. strlcpy (dest, src, 2213) and 2104074 calls used a maximum copy size of 64 bytes.

I know the 2213 size calls come from the set_ps_display () function. I don't know where the 64 size calls come from, yet.

In the 64 size case, with the exception of 35 calls, calls for size 64 are only copying 1 byte - I would assume this is a NULL.

In the 2213 size case, 1933027 calls copy 20 bytes; 2189415 calls copy 5 bytes; 85550 calls copy 6 bytes and 2018482 calls copy 7 bytes.

Based on this data, it would seem that either memcpy () or strlcpy () calls would be better due to the source string size.

Call originating from the set_ps_display () function might be able to use the memcpy () call as the size of the source string should be known. The other calls probably need something like strlcpy () as the source string might not be known, although using memcpy () to copy in XX byte blocks might be interesting.

David

________________________________

From: pgsql-hackers-owner(at)postgresql(dot)org on behalf of mark(at)mark(dot)mielke(dot)cc
Sent: Fri 9/29/2006 2:59 PM
To: Tom Lane
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Faster StrNCpy

On Fri, Sep 29, 2006 at 05:34:30PM -0400, Tom Lane wrote:
> mark(at)mark(dot)mielke(dot)cc writes:
> > If anybody is curious, here are my numbers for an AMD X2 3800+:
> You did not show your C code, so no one else can reproduce the test on
> other hardware. However, it looks like your compiler has unrolled the
> memcpy into straight-line 8-byte moves, which makes it pretty hard for
> anything operating byte-wise to compete, and is a bit dubious for the
> general case anyway (since it requires assuming that the size and
> alignment are known at compile time).

I did show the .s code. I call into x_memcpy(a, b), meaning that the
compiler can't assume anything. It may happen to be aligned.

Here are results over 64 Mbytes of memory, to ensure that every call is
a cache miss:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
NONE: 767243 us
MEMCPY: 6044137 us
STRNCPY: 10741759 us
STRLCPY: 12061630 us
LENCPY: 9459099 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
NONE: 712193 us
MEMCPY: 6072312 us
STRNCPY: 9982983 us
STRLCPY: 6605052 us
LENCPY: 7128258 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE: 708164 us
MEMCPY: 6042817 us
STRNCPY: 8885791 us
STRLCPY: 5592477 us
LENCPY: 6135550 us

At least on my machine, memcpy() still comes out on top. Yes, assuming that
it is aligned correctly for the machine. Here is unaliagned (all arrays are
stored +1 offset in memory):

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE: 790932 us
MEMCPY: 6591559 us
STRNCPY: 10622291 us
STRLCPY: 12070007 us
LENCPY: 10322541 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE: 764577 us
MEMCPY: 6631731 us
STRNCPY: 9513540 us
STRLCPY: 6615345 us
LENCPY: 7263392 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE: 825689 us
MEMCPY: 6607777 us
STRNCPY: 8976487 us
STRLCPY: 5878088 us
LENCPY: 6180358 us

Alignment looks like it does impact the results for memcpy(). memcpy()
changes from around 6.0 seconds to 6.6 seconds. Overall, though, it is
still the winner in all cases accept for strlcpy(), which beats it on
very short strings ("").

Here is the cache hit case including your strlen+memcpy as 'LENCPY':

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE: 696157 us
MEMCPY: 825118 us
STRNCPY: 7983159 us
STRLCPY: 10787462 us
LENCPY: 6048339 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE: 700201 us
MEMCPY: 593701 us
STRNCPY: 7577380 us
STRLCPY: 3727801 us
LENCPY: 3169783 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE: 706283 us
MEMCPY: 792719 us
STRNCPY: 7870425 us
STRLCPY: 681334 us
LENCPY: 2062983 us

First call was every call being a cache hit. With this one, every one is
a cache miss, and the 64-byte blocks are spread equally over 64 Mbytes of
memory. I've attached the code for your consideration. x.c is the routines
I used to perform the tests. y.c is the main program. strlcpy.c is copied
from the online reference as is without change. The compilation steps
are described above. STRING is the string to try out. N is the number
of 64-byte blocks to allocate. ALIGN is the number of bytes to offset
the array by when storing / reading / writing. ALIGN should be >= 0.

At N=1, it's all in cache. At N=1024*1024 it is taking up 64 Mbytes of
RAM.

Cheers,
mark

--
mark(at)mielke(dot)cc / markm(at)ncf(dot)ca / markm(at)nortel(dot)com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2006-10-02 16:25:45 Re: Select for update with outer join broken?
Previous Message Tom Lane 2006-10-02 15:38:45 Re: Update using sub-select table in schema

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2006-10-02 18:30:11 Re: Faster StrNCpy
Previous Message Tom Lane 2006-10-02 14:33:18 Re: [HACKERS] Bad bug in fopen() wrapper code