Re: Use simplehash.h instead of dynahash in SMgr

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use simplehash.h instead of dynahash in SMgr
Date: 2021-04-25 02:23:26
Message-ID: CAApHDvpVfW3gi0=UA7MjpRRQwy2k6D139tWCkrM6-ACqBp5E=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for having a look at this.

"On Sun, 25 Apr 2021 at 10:27, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>
> It is quite interesting result. Simplehash being open-addressing with
> linear probing is friendly for cpu cache. I'd recommend to define
> SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
> suitable most for such kind of hash table.

You might be right there, although, with the particular benchmark I'm
using the size of the table does not change as a result of that. I'd
need to experiment with varying numbers of relations to see if
dropping the fillfactor helps or hinders performance.

FWIW, the hash stats at the end of recovery are:

LOG: redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
system: 5.61 s, elapsed: 112.67 s
LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

Perhaps if try using a number of relations somewhere between 2048 *
0.75 and 2048 * 0.9 then I might see some gains. Because I have 2032,
the hash table grew up to 4096 buckets.

I did a quick test dropping the fillfactor down to 0.4. The aim there
was just to see if having 8192 buckets in this test would make it
faster or slower

LOG: redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
system: 4.28 s, elapsed: 113.93 s
LOG: size: 8192, members: 2032, filled: 0.248047, total chain: 303,
max chain: 2, avg chain: 0.149114, total_collisions: 209,
max_collisions: 2, avg_collisions: 0.102854

it was slightly slower. I guess since the SMgrEntry is just 16 bytes
wide that 4 of these will sit on each cache line which means there is
a 75% chance that the next bucket over is on the same cache line.
Since the average chain length is just 0.49 then we'll mostly just
need to look at a single cache line to find the entry with the correct
hash key.

> > + /* rotate hashkey left 1 bit at each step */
> > + hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
> > + hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
>
> Why do you use so strange rotation expression? I know compillers are
> able
> to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
> Do they recognize construction in your code as well?

Not sure about all compilers, I only checked the earliest version of
clang and gcc at godbolt.org and they both use a single "rol"
instruction. https://godbolt.org/z/1GqdE6T3q

> Your construction looks more like "multiplate-modulo" operation in 32bit
> Galois field . It is widely used operation in cryptographic, but it is
> used modulo some primitive polynomial, and 0x100000001 is not such
> polynomial. 0x1000000c5 is, therefore it should be:
>
> hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
> or
> hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5);

That does not really make sense to me. If you're shifting a 32-bit
variable left 31 places then why would you AND with 0xc5? The only
possible result is 1 or 0 depending on if the most significant bit is
on or off. I see gcc and clang are unable to optimise that into an
"rol" instruction. If I swap the "0xc5" for "1", then they're able to
optimise the expression.

> But why don't just use hash_combine(uint32 a, uint32 b) instead (defined
> in hashfn.h)? Yep, it could be a bit slower, but is it critical?

I had that function in the corner of my eye when writing this, but
TBH, the hash function performance was just too big a factor to slow
it down any further by using the more expensive hash_combine()
function. I saw pretty good performance gains from writing my own hash
function rather than using hash_bytes(). I didn't want to detract from
that by using hash_combine(). Rotating the bits left 1 slot seems
good enough for hash join and hash aggregate, so I don't have any
reason to believe it's a bad way to combine the hash values. Do you?

If you grep the source for "hashkey = (hashkey << 1) | ((hashkey &
0x80000000) ? 1 : 0);", then you'll see where else we do the same
rotate left trick.

> > - * smgrclose() -- Close and delete an SMgrRelation object.
> > + * smgrclose() -- Close and delete an SMgrRelation object but don't
> > + * remove from the SMgrRelationHash table.
>
> I believe `smgrclose_internal()` should be in this comment.

Oops. Yeah, that's a mistake.

> Still I don't believe it worth to separate smgrclose_internal from
> smgrclose. Is there measurable performance improvement from this
> change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

The reason I did that is due to the fact that smgrcloseall() loops
over the entire hash table and removes each entry one by one. The
problem is that if I do a smgrtable_delete or smgrtable_delete_item in
that loop then I'd need to restart the loop each time. Be aware that
a simplehash delete can move entries earlier in the table, so it might
cause us to miss entries during the loop. Restarting the loop each
iteration is not going to be very efficient, so instead, I opted to
make a version of smgrclose() that does not remove from the table so
that I can just wipe out all table entries at the end of the loop. I
called that smgrclose_internal(). Maybe there's a better name, but I
don't really see any realistic way of not having some version that
skips the hash table delete. I was hoping the 5 line comment I added
to smgrcloseall() would explain the reason for the code being written
way.

An additional small benefit is that smgrclosenode() can get away with
a single hashtable lookup rather than having to lookup the entry again
with smgrtable_delete(). Using smgrtable_delete_item() deletes by
bucket rather than key value which should be a good bit faster in many
cases. I think the SH_ENTRY_CLEANUP macro is quite useful here as I
don't need to worry about NULLing out the smgr_owner in yet another
location where I do a hash delete.

> As well I don't support modification simplehash.h for
> SH_ENTRY_INITIALIZER,
> SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
> live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
> doesn't mean much.

Can you share what you've got in mind here?

The problem I'm solving with SH_ENTRY_INITIALIZER is the fact that in
SH_INSERT_HASH_INTERNAL(), when we add a new item, we do entry->SH_KEY
= key; to set the new entries key. Since I have SH_KEY defined as:

#define SH_KEY data->smgr_rnode

then I need some way to allocate the memory for ->data before the key
is set. Doing that in smrgopen() is too late. We've already crashed by
then for referencing uninitialised memory.

I did try putting the key in SMgrEntry but found the performance to be
quite a bit worse than keeping the SMgrEntry down to 16 bytes. That
makes sense to me as we only need to compare the key when we find an
entry with the same hash value as the one we're looking for. There's a
pretty high chance of that being the entry we want. If I got my hash
function right then the odds are about 1 in 4 billion of it not being
the one we want. The only additional price we pay when we get two
entries with the same hash value is an additional pointer dereference
and a key comparison.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Japin Li 2021-04-25 03:23:42 Re: Forget close an open relation in ReorderBufferProcessTXN()
Previous Message Yura Sokolov 2021-04-24 22:27:24 Re: Use simplehash.h instead of dynahash in SMgr