Re: WIP: dynahash replacement for buffer table

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-17 19:02:51
Message-ID: 20141017190251.GG2075@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-10-16 20:22:24 -0400, Robert Haas wrote:
> On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > When using shared_buffers = 96GB there's a performance benefit, but not
> > huge:
> > master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8
> > master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7
> >
> > But with shared_buffers = 16GB:
> > master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9
> > master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1
> > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8
>
> Very interesting. This doesn't show that chash is the right solution,
> but it definitely shows that doing nothing is the wrong solution.

Absolutely.

The thing worrying me most (but not all that much on an absolute scale)
about chash is that lots of the solutions to memory management it builds
are specific to it... And generalizing afterwards will be hard because
we'll have to prove that that general solution is as performant as the
special case code...

> It shows that, even with the recent bump to 128 lock manager
> partitions, and LW_SHARED on top of that, workloads that actually
> update the buffer mapping table still produce a lot of contention
> there.

FWIW, I couldn't see much of a benefit without LW_SHARED even though I
a *few* things. The bottleneck simply is completely elsewhere.

> This hasn't been obvious to me from profiling, but the numbers
> above make it pretty clear.

So I'm not super surprised about that. There very well might be cases
where it's a bad bottleneck before, but I've not seen them.

> It also seems to suggest that trying to get rid of the memory barriers
> isn't a very useful optimization project.

I'm not yet convinced of that. Yes, it's not showing up hugely in the
numbers here, but that's simply because the workload is again completely
dominated by the kernel copying data for the read()s, GetSnapshotData(),
the buffer mapping cache misses and a few other familiar friends.

> We might get a couple of
> percent out of it, but it's pretty small potatoes, so unless it can be
> done more easily than I suspect, it's probably not worth bothering
> with.

I still think that moving the barrier to the reading side would be
simple (implementation wise) and correct for x86. If you think about it,
otherwise our spinlock implementation for x86 would be broken. As a
unlock is just a compiler barrier the full barrier on acquire better be
a full synchronization point. Am I missing something?

> An approach I think might have more promise is to have bufmgr.c
> call the CHash stuff directly instead of going through buf_table.c.

I don't see all that much point in buf_table.c currently - on the other
hand it has lead to it replacing the buffer mapping being slightly
easier... Maybe it should just go to a header...

> Right now, for example, BufferAlloc() creates and initializes a
> BufferTag and passes a pointer to that buffer tag to BufTableLookup,
> which copies it into a BufferLookupEnt. But it would be just as easy
> for BufferAlloc() to put the BufferLookupEnt on its own stack, and
> then you wouldn't need to copy the data an extra time. Now a 20-byte
> copy isn't a lot, but it's completely unnecessary and looks easy to
> get rid of.

Worthwile to try.

> > I had to play with setting max_connections+1 sometimes to get halfway
> > comparable results for master - unaligned data otherwise causes wierd
> > results otherwise. Without doing that the performance gap between master
> > 96/16G was even bigger. We really need to fix that...
> >
> > This is pretty awesome.
>
> Thanks. I wasn't quite sure how to test this or where the workloads
> that it would benefit would be found, so I appreciate you putting time
> into it. And I'm really glad to hear that it delivers good results.

I wasn't sure either ;). Lucky that it played out so impressively. After
the first results I was nearly ready to send out a 'Meh.' type of
message ;)

Btw, CHashTableGetNode isn't exactly cheap. It shows up noticeably in
profiles. It results in several non-pipelineable instructions in a
already penalized section of the code... Replacing arena_stride by a
constant provided measurable improvements (no imul is required anymore,
instead you get shr;lea or so). I'm not sure how to deal with that. If
it shows up even on my quite new laptop, it'll be more so noticeable on
older x86 platforms.

> I think it would be useful to plumb the chash statistics into the
> stats collector or at least a debugging dump of some kind for testing.

I'm not sure it's solid enough at this point to be in the stats
collector. But I surely would like to access it somehow. I'm
e.g. absolutely not sure that your loadfactor is "good" and it'd be much
easier if those stats where visible. I'm not really sure how to make
them visible though.

> They include a number of useful contention measures, and I'd be
> interested to see what those look like on this workload. (If we're
> really desperate for every last ounce of performance, we could also
> disable those statistics in production builds. That's probably worth
> testing at least once to see if it matters much, but I kind of hope it
> doesn't.)

I can't retest on bigger HW right now, but IIRC they didn't show up in
the profile there. It's not visible on my laptop.

Profilewise it'd be helpful if BucketScan would be inlined. Right now
it's hard to see the difference in cost between search/insert/delete and
I think that'd be worth the cost of duplicated instructions...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas 'ads' Scherbaum 2014-10-17 19:12:46 Re: documentation update for doc/src/sgml/func.sgml
Previous Message CK Tan 2014-10-17 18:52:32 Re: Vitesse DB call for testing