Quick Links

Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Sergey Koposov <koposov(at)ast(dot)cam(dot)ac(dot)uk>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Stephen Frost <sfrost(at)snowman(dot)net>
Subject:	Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
Date:	2012-06-07 17:56:17
Message-ID:	CA+CSw_tFM4_e3Xrak1_Ewivd_Wo54kS7y3RzYRv-=7C-27y-Sw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jun 6, 2012 at 11:42 PM, Sergey Koposov <koposov(at)ast(dot)cam(dot)ac(dot)uk> wrote:
> On Wed, 6 Jun 2012, Merlin Moncure wrote:
>>
>> I think this is the expected result. In the single user case the
>> spinklock never spins and only has to make the cpu-locking cache
>> instructions once. can we see results @24 threads?
>
>
> Here https://docs.google.com/open?id=0B7koR68V2nM1NDJHLUhNSS0zbUk

Thank you for testing, really interesting results. It looks like this
workload really isn't bound by BufFreelistLock, despite my best
efforts. I did some perf analysis on my machine.

First thing I noticed was that the profile was dominated by copying
buffers to postgresql shared_buffers. Seems obvious after the fact. To
see BufFreelistLock contention there needs to be some noticeable work
under that lock besides it being frequently acquired. I bumped
shared_buffers to 512M and the table size to 30M rows to compensate,
making the dataset 4.7G, with 640M of index.

Looking at instruction level annotation of StrategyGetBuffer I saw
about 1/3 of time taken up by the numBufferAllocs and bgwriterLatch
atomic changes. So I added a lockfree test for bgwriterLatch, and used
a local var for allocation counting, updating the shared var every 16
allocations, making the common case only do one locked instruction. I
added padding to separate frequently and less frequently modified vars
to different cache lines. This should be good enough for performance
testing purposes, a proper solution should at the very least handle
flushing the buffer allocation count on backend termination.

Attached are perf reports for master and the patched version running
the enlarged workload at -c 4. BufFreelistLock doesn't appear under
LWLockAcquire anymore, but all the overhead shifted to
StrategyGetBuffer. Basically it's the locked op overhead that is being
moved around. Some other observations:
* While there isn't any noticeable contention, the locking overhead is
still huge, looking at instruction level timings, it seems to be just
the cost of getting the lock cache line in exclusive mode.
* After patching StrategyGetBuffer is 59% buffer header lock acquisition.
* All of the top 11 items in the profile are dominated by one or two
cache misses. Specifically, navigating collision chains for
hash_search_with_hash_value, the lock for LWLockAcquire,
LWLockRelease, buffer header for StrategyGetBuffer, ReadBuffer_common,
PinBuffer, UnpinBuffer, TerminateBufferIO, PrivateRefCount for
PinBuffer, PinBuffer_locked, UnpinBuffer, heap tuple for
heap_hot_search_buffer.

Based on this information, it looks like modifying anything in the
buffer header can be a pretty heavy operation. Making heavily accessed
buffers read-mostly (eg by nailing them) could give a healthy boost to
performance. Atomic ops probably won't help significantly to alleviate
the cost because the cache lines need to bounce around anyway, except
maybe in extreme contention cases where spinning causes stolen cache
lines.

I also attached a patch with the most recent changes, should at least
fix the regression for single client case, maybe improve scaling. This
patch doesn't seem useful until the other sources of contention are
alleviated. I won't pursue getting this into a committable state
unless someone can show a workload where the clock sweep is an actual
bottleneck.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment	Content-Type	Size
lockfree-getbuffer.v2.perfreport.txt.bz2	application/x-bzip2	77.3 KB
master.perfreport.txt.bz2	application/x-bzip2	78.2 KB
lockfree-getbuffer.v2.patch	application/octet-stream	8.9 KB

In response to

Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile at 2012-06-06 20:42:45 from Sergey Koposov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fabricio	2012-06-07 18:01:37	could not rename temporary statistics file "pg_stat_tmp/pgstat.tmp" to "pg_stat_tmp/pgstat.stat": No such file or directory
Previous Message	Robert Haas	2012-06-07 17:42:30	Re: XLog changes for 9.3