Re: Use simplehash.h instead of dynahash in SMgr

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use simplehash.h instead of dynahash in SMgr
Date: 2021-04-25 06:48:52
Message-ID: c12fd75eb363f2da50ee29013bd73503@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Rowley писал 2021-04-25 05:23:
> Thanks for having a look at this.
>
> "On Sun, 25 Apr 2021 at 10:27, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
> wrote:
>>
>> It is quite interesting result. Simplehash being open-addressing with
>> linear probing is friendly for cpu cache. I'd recommend to define
>> SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
>> suitable most for such kind of hash table.
>
> You might be right there, although, with the particular benchmark I'm
> using the size of the table does not change as a result of that. I'd
> need to experiment with varying numbers of relations to see if
> dropping the fillfactor helps or hinders performance.
>
> FWIW, the hash stats at the end of recovery are:
>
> LOG: redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
> system: 5.61 s, elapsed: 112.67 s
> LOG: size: 4096, members: 2032, filled: 0.496094, total chain: 997,
> max chain: 5, avg chain: 0.490650, total_collisions: 422,
> max_collisions: 3, avg_collisions: 0.207677
>
> Perhaps if try using a number of relations somewhere between 2048 *
> 0.75 and 2048 * 0.9 then I might see some gains. Because I have 2032,
> the hash table grew up to 4096 buckets.
>
> I did a quick test dropping the fillfactor down to 0.4. The aim there
> was just to see if having 8192 buckets in this test would make it
> faster or slower
>
> LOG: redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
> system: 4.28 s, elapsed: 113.93 s
> LOG: size: 8192, members: 2032, filled: 0.248047, total chain: 303,
> max chain: 2, avg chain: 0.149114, total_collisions: 209,
> max_collisions: 2, avg_collisions: 0.102854
>
> it was slightly slower.

Certainly. That is because in unmodified case you've got fillfactor 0.49
because table just grew. Below somewhat near 0.6 there is no gain in
lower
fillfactor. But if you test it when it closer to upper bound, you will
notice difference. Try to test it with 3600 nodes, for example, if
going down to 1800 nodes is not possible.

>> > + /* rotate hashkey left 1 bit at each step */
>> > + hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
>> > + hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
>>
>> Why do you use so strange rotation expression? I know compillers are
>> able
>> to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
>> Do they recognize construction in your code as well?
>
> Not sure about all compilers, I only checked the earliest version of
> clang and gcc at godbolt.org and they both use a single "rol"
> instruction. https://godbolt.org/z/1GqdE6T3q

Yep, looks like all compilers recognize such construction with single
exception of old icc compiler (both 13.0.1 and 16.0.3):
https://godbolt.org/z/qsrjY5Eof
and all compilers recognize `(h << 1) | (h >> 31)` well

>> Your construction looks more like "multiplate-modulo" operation in
>> 32bit
>> Galois field . It is widely used operation in cryptographic, but it is
>> used modulo some primitive polynomial, and 0x100000001 is not such
>> polynomial. 0x1000000c5 is, therefore it should be:
>>
>> hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
>> or
>> hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) &
>> 0xc5);
>
> That does not really make sense to me. If you're shifting a 32-bit
> variable left 31 places then why would you AND with 0xc5? The only
> possible result is 1 or 0 depending on if the most significant bit is
> on or off.

That is why there is cast to signed int before shifting: `(int32)hashkey
>> 31`.
Shift is then also signed ie arithmetic, and results are 0 or
0xffffffff.

>> But why don't just use hash_combine(uint32 a, uint32 b) instead
>> (defined
>> in hashfn.h)? Yep, it could be a bit slower, but is it critical?
>
> I had that function in the corner of my eye when writing this, but
> TBH, the hash function performance was just too big a factor to slow
> it down any further by using the more expensive hash_combine()
> function. I saw pretty good performance gains from writing my own hash
> function rather than using hash_bytes(). I didn't want to detract from
> that by using hash_combine(). Rotating the bits left 1 slot seems
> good enough for hash join and hash aggregate, so I don't have any
> reason to believe it's a bad way to combine the hash values. Do you?

Well, if think a bit more, this hash values could be combined with using
just addition: `hash(a) + hash(b) + hash(c)`.

I thought more about consistency in a codebase. But looks like both ways
(`hash_combine(a,b)` and `rotl(a,1)^b`) are used in a code.
- hash_combine is in one time/three lines in hashTupleDesc at
tupledesc.c
- rotl+xor six times:
-- three times/three lines in execGrouping.c with construction like
yours
-- three times in jsonb_util.c, multirangetypes.c and rangetypes.c with
`(h << 1) | (h >> 31)`.
Therefore I step down on recommendation in this place.

Looks like it is possibility for micropatch to unify hash combining :-)

>
> If you grep the source for "hashkey = (hashkey << 1) | ((hashkey &
> 0x80000000) ? 1 : 0);", then you'll see where else we do the same
> rotate left trick.
>
>> > - * smgrclose() -- Close and delete an SMgrRelation object.
>> > + * smgrclose() -- Close and delete an SMgrRelation object but don't
>> > + * remove from the SMgrRelationHash table.
>>
>> I believe `smgrclose_internal()` should be in this comment.
>
> Oops. Yeah, that's a mistake.
>
>> Still I don't believe it worth to separate smgrclose_internal from
>> smgrclose. Is there measurable performance improvement from this
>> change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .
>
> The reason I did that is due to the fact that smgrcloseall() loops
> over the entire hash table and removes each entry one by one. The
> problem is that if I do a smgrtable_delete or smgrtable_delete_item in
> that loop then I'd need to restart the loop each time. Be aware that
> a simplehash delete can move entries earlier in the table, so it might
> cause us to miss entries during the loop. Restarting the loop each
> iteration is not going to be very efficient, so instead, I opted to
> make a version of smgrclose() that does not remove from the table so
> that I can just wipe out all table entries at the end of the loop. I
> called that smgrclose_internal().

If you read comments in SH_START_ITERATE, you'll see:

* Search for the first empty element. As deletions during iterations
are
* supported, we want to start/end at an element that cannot be
affected
* by elements being shifted.

* Iterate backwards, that allows the current element to be deleted,
even
* if there are backward shifts

Therefore, it is safe to delete during iteration, and it doesn't lead
nor
require loop restart.

>
> An additional small benefit is that smgrclosenode() can get away with
> a single hashtable lookup rather than having to lookup the entry again
> with smgrtable_delete(). Using smgrtable_delete_item() deletes by
> bucket rather than key value which should be a good bit faster in many
> cases. I think the SH_ENTRY_CLEANUP macro is quite useful here as I
> don't need to worry about NULLing out the smgr_owner in yet another
> location where I do a hash delete.

Doubtfully it makes sense since smgrclosenode is called only in
LocalExecuteInvalidationMessage, ie when other backend drops some
relation. There is no useful performance gain from it.

>
>> As well I don't support modification simplehash.h for
>> SH_ENTRY_INITIALIZER,
>> SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
>> live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
>> doesn't mean much.
>
> Can you share what you've got in mind here?
>
> The problem I'm solving with SH_ENTRY_INITIALIZER is the fact that in
> SH_INSERT_HASH_INTERNAL(), when we add a new item, we do entry->SH_KEY
> = key; to set the new entries key. Since I have SH_KEY defined as:
>
> #define SH_KEY data->smgr_rnode
>
> then I need some way to allocate the memory for ->data before the key
> is set. Doing that in smrgopen() is too late. We've already crashed by
> then for referencing uninitialised memory.

Oh, now I see.
I could suggest work-around:
- use entry->hash as a whole key value and manually resolve hash
collision with chaining.
But it looks ugly: use hash table and still manually resolve collisions.

Therefore perhaps SH_ENTRY_INITIALIZER has sense.

But SH_ENTRY_CLEANUP is abused in the patch: it is not symmetric to
SH_ENTRY_INITIALIZER. It smells bad. `smgr_owner` is better cleaned
in a way it is cleaned now in smgrclose because it is less obscure.
And SH_ENTRY_CLEANUP should be just `pfree(a->data)`.

And still no reason to have SH_TRUNCATE.

> I did try putting the key in SMgrEntry but found the performance to be
> quite a bit worse than keeping the SMgrEntry down to 16 bytes. That
> makes sense to me as we only need to compare the key when we find an
> entry with the same hash value as the one we're looking for. There's a
> pretty high chance of that being the entry we want. If I got my hash
> function right then the odds are about 1 in 4 billion of it not being
> the one we want. The only additional price we pay when we get two
> entries with the same hash value is an additional pointer dereference
> and a key comparison.

It has sense: whole benefit of simplehash is cache locality, and
it is gained with smaller entry.

regards,
Yura Sokolov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2021-04-25 08:11:19 Some oversights in query_id calculation
Previous Message Japin Li 2021-04-25 03:23:42 Re: Forget close an open relation in ReorderBufferProcessTXN()