Re: WIP: dynahash replacement for buffer table

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-10-14 21:53:10
Message-ID: CA+TgmoYZ=0vO5W97v=EH18WrpsgLELOih3yYiSPyHQtBOa=XUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> The code in CHashSearch shows the problem there: you need to STORE the
>> hazard pointer before you begin to do the LOAD operations to scan the
>> bucket, and you must finish all of those LOADs before you STORE the
>> NULL hazard pointer. A read or write barrier won't provide store/load
>> or load/store ordering.
>
> I'm not sure I understand the problem here - but a read + write barrier
> should suffice? To avoid falling back to two full barriers, we'd need a
> separate define pg_read_write_barrier(), but that's not a problem. IIUC
> that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain. It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads. We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

I tested PPC, because that's what I have. I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases. The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it. Or that's my reading anyway.

http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/

>> > My guess is that the additional indirection via the arena explains the
>> > difference in cache misses? But I might be completely off here.
>>
>> The arena makes the calculation of the pointer address less
>> predictable, which might make things more difficult for the CPU
>> pipeline. But aside from that, I think it's the same basic idea: you
>> bounce from some sort of bucket header to the first element of the
>> chain and then, hopefully, you're done. Am I missing something?
>
> I haven't really read much of the code yet (wrote most of this email on
> my workstation before embarking on a plane), but afaics there's one
> indirection to the bucket, and then one to the first node in the linked
> list. Right?
> In dynahash it's first to the segment, but there's only few, so it's
> likely to be cached. Then to the bucket - which, afaics, already can
> contain the key.
>
> So it's not really the arena, but that you chose not to store the first
> element in a chain inline...

Hmm, you have a point. I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise. If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm. There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-10-14 22:09:10 Re: Better support of exported snapshots with pg_dump
Previous Message Alvaro Herrera 2014-10-14 21:24:13 Re: narwhal and PGDLLIMPORT