Re: WIP: dynahash replacement for buffer table

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 20:42:32
Message-ID: 20141014204232.GA7242@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-10-14 11:19:16 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> The key idea here is that lookups are done without any locks, only
> >> memory barriers; and inserts and deletes are done using atomic ops.
> >
> > Hm. I quickly looked and I see that you use two full barriers for every
> > lookup. That's pretty expensive. I think this should be doable using
> > only read/write barriers on the lookup side? Even on architectures where
> > read/write memory barriers have to be something but a compiler barrier,
> > they're still usually a magnitude or so cheaper than full barriers.
>
> The code in CHashSearch shows the problem there: you need to STORE the
> hazard pointer before you begin to do the LOAD operations to scan the
> bucket, and you must finish all of those LOADs before you STORE the
> NULL hazard pointer. A read or write barrier won't provide store/load
> or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

> > With regard for using a hash table for the buffer mapping lock I'm
> > doubtful that any form of separate chaining is the right one. We
> > currently have a quite noticeable problem with the number of cache
> > misses in the buffer mapping hash (and also the heavyweight lock mgr) -
> > if we stay with hashes that's only going to be fundamentally lower than
> > dynahash if we change the type of hashing. I've had good, *preliminary*,
> > results using an open addressing + linear probing approach.
>
> I'm very skeptical of open addressing + linear probing. Such hash
> tables tend to degrade over time, because you have to leave tombstones
> behind to ensure that future scans advance far enough.

You can try to be a bit smarter than a plain open addressing
approach. But yes, it has disadvantages.

> There's really
> no way to recover without rebuilding the whole hash table, and
> eventually it degrades to linear search. If we're spending too much
> time walking hash chains, I think the solution is to increase the
> number of buckets so that the chains get shorter.

Might be worthwile to try. I know that both my handrolled open
addressing and my hand rolled chaining approach have significantly fewer
cache misses than dynahash for the same amount of work.

Hm. Possibly that's at least partially because of the segment stuff in
dynahash?

Anyway, you can get the size of the entire hashtable down quite a
bit. Primarily because there's no need to store a next pointer. There's
also not really any need for the 'hashvalue' in the bufmgr case
imo.

> > My guess is that the additional indirection via the arena explains the
> > difference in cache misses? But I might be completely off here.
>
> The arena makes the calculation of the pointer address less
> predictable, which might make things more difficult for the CPU
> pipeline. But aside from that, I think it's the same basic idea: you
> bounce from some sort of bucket header to the first element of the
> chain and then, hopefully, you're done. Am I missing something?

I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.

So it's not really the arena, but that you chose not to store the first
element in a chain inline...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2014-10-14 20:56:13 Re: pg_background (and more parallelism infrastructure patches)
Previous Message Dimitri Fontaine 2014-10-14 20:19:26 New Event Trigger: table_rewrite