Re: tweaking MemSet() performance - 7.4.5

From: Marc Colosimo <mcolosimo(at)mitre(dot)org>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: List pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Marc Colosimo <mcolosimo(at)mitre(dot)org>, Manfred Spraul <manfred(at)colorfullife(dot)com>, Karel Zak <zakkr(at)zf(dot)jcu(dot)cz>
Subject: Re: tweaking MemSet() performance - 7.4.5
Date: 2004-09-29 13:38:39
Message-ID: DF0A0E72-121C-11D9-830D-000A95A5D8B2@mitre.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sep 29, 2004, at 7:37 AM, Bruce Momjian wrote:

> Karel Zak wrote:
>> On Sat, 2004-09-25 at 23:23 +0200, Manfred Spraul wrote:
>>> mcolosimo(at)mitre(dot)org wrote:
>>>
>>>>> If the memset
>>>>> bypasses the cache then the following access will cause a cache
>>>>> line
>>>>> miss, which can be so slow that using the faster memset can result
>>>>> in a
>>>>> net performance loss.
>>>>
>>>> Could you suggest some structs to test? If I get your meaning, I
>>>> would make a loop that sets then reads from the structure.
>>>>
>>> Read the sources and the cpu specs. Benchmarking such problems is
>>> virtually impossible.
>>> I don't have OS-X, thus I checked the Linux-kernel sources: It seems
>>> that the power architecture doesn't have the same problem as x86.
>>> There is a special clear cacheline instruction for large memsets and
>>> the
>>> rest is done through carefully optimized store
>>> byte/halfword/word/double
>>> word sequences.
>>>
>>> Thus I'd check what happens if you memset not perfectly aligned
>>> buffers.
>>> That's another point where over-optimized functions sometimes break
>>> down. If there is no slowdown, then I'd replace the postgres function
>>> with the OS provided function.
>>>

all memory (via malloc and friends) will be aligned on OS X, unless you
remove padding (which I don't think you do)

>>> I'd add some __builtin_constant_p() optimizations, but I guess Tom
>>> won't
>>> like gcc hacks ;-)
>>
>> I think it cannot be problem if you write it to some .h file (in port
>> directory?) as macro with "#ifdef GCC". The other thing is real
>> advantage of hacks like this in practical PG usage :-)
>
> The reason MemSet is a win is not that the C code is great but because
> it eliminates a function call.
>

Using MemSet really did speed things up. I think the function overhead
is okay. As for real world usage, the function ExecMakeFunctionResult
dropped from the top of the list when profiling (now < 1% vs 16%
before)! This was doing a big nasty delete (w/ cascading), insert in a
cursor.

Here are results for a Mac G4 (single processor) OS 10.3, using -O2.
This time the mac memset wins all around. Someone posted that this
wasn't the case.

PG MemSet:
pgmemset_test 32
0.670u 0.020s 0:00.70 98.5% 0+0k 0+0io 0pf+0w
pgmemset_test 64
1.060u 0.000s 0:01.05 100.9% 0+0k 0+0io 0pf+0w
pgmemset_test 128
1.750u 0.010s 0:01.76 100.0% 0+0k 0+0io 0pf+0w
pgmemset_test 512
6.010u 0.030s 0:06.04 100.0% 0+0k 0+0io 0pf+0w

Mac memset:
memset_test 32
0.660u 0.020s 0:00.67 101.4% 0+0k 0+0io 0pf+0w
memset_test 64
0.720u 0.000s 0:00.72 100.0% 0+0k 0+0io 0pf+0w
memset_test 128
0.800u 0.010s 0:00.81 100.0% 0+0k 0+0io 0pf+0w
memset_test 512
1.470u 0.010s 0:01.48 100.0% 0+0k 0+0io 0pf+0w

Now I check about setting a byte after I memset, and it does slow down
a tiny bit. But it is the same for both MemSet and memset for under 64.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kris Kiger 2004-09-29 14:33:06 Re: tsearch2 poor performance
Previous Message Merlin Moncure 2004-09-29 12:56:12 Re: shared memory release following failed lock acquirement.