Re: General purpose hashing func in pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: General purpose hashing func in pgbench
Date: 2018-01-13 08:16:29
Message-ID: alpine.DEB.2.20.1801121614470.13422@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Ildar,

>> so that different instances of hash function within one script would
>> have different seeds. Yes, that is a good idea, I can do that.
>>
> Added this feature in attached patch. But on a second thought this could
> be something that user won't expect. For example, they may want to run
> pgbench with two scripts:
> - the first one updates row by key that is a hashed random_zipfian value;
> - the second one reads row by key generated the same way
> (that is actually what YCSB workloads A and B do)
>
> It feels natural to write something like this:
> \set rnd random_zipfian(0, 1000000, 0.99)
> \set key abs(hash(:rnd)) % 1000
> in both scripts and expect that they both would have the same
> distribution. But they wouldn't. We could of course describe this
> implicit behaviour in documentation, but ISTM that shared seed would be
> more clear.

I think that it depends on the use case, that both can be useful, so there
should be a way to do either.

With "always different" default seed, distinct distributions are achieved
with:

-- DIFF distinct seeds inside and between runs
\set i1 abs(hash(:r1)) % 1000
\set j1 abs(hash(:r2)) % 1000

and the same distribution can be done with an explicit seed:

-- DIFF same seed inside and between runs
\set i1 abs(hash(:r1), 5432) % 1000
\set j1 abs(hash(:r2), 5432) % 1000

The drawback is that the same seed is used between runs in this case,
which is not desirable. This could be circumvented by adding the random
seed as an automatic variable and using it, eg:

-- DIFF same seed inside run, distinct between runs
\set i1 abs(hash(:r1), :random_seed + 5432) % 1000
\set j1 abs(hash(:r2), :random_seed + 2345) % 1000

Now with a shared hash_seed the same distribution is by default:

-- SHARED same underlying hash_seed inside run, distinct between runs
\set i1 abs(hash(:r1)) % 1000
\set j1 abs(hash(:r2)) % 1000

However some trick is needed now to get distinct seeds. With

-- SHARED distinct seed inside run, but same between runs
\set i1 abs(hash(:r1, 5432)) % 1000
\set j1 abs(hash(:r2, 2345)) % 1000

We are back to the same issue has the previous case because then the
distribution is the same from one run to the next, which is not desirable.
I found this workaround trick:

-- SHARED distinct seeds inside and between runs
\set i1 abs(hash(:r1, hash(5432))) % 1000
\set j1 abs(hash(:r2, hash(2345))) % 1000

Or with a new :hash_seed or :random_seed automatic variable, we could also
have:

-- SHARED distinct seeds inside and between runs
\set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000
\set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000

It provides controllable distinct seeds between runs but equal one between
if desired, by reusing the same value to be hashed as a seed.

I also agree with your argument that the user may reasonably expect that
hash(5432) == hash(5432) inside and between scripts, at least on the same
run, so would be surprised that it is not the case.

So I've changed my mind, I'm sorry for making you going back and forth on
the subject. I'm now okay with one shared 64 bit hash seed, with a clear
documentation about the fact, and an outline of the trick to achieve
distinct distributions inside a run if desired and why it would be
desirable to avoid correlations. Also, I think that providing the seed as
automatic variable (:hash_seed or :hseed or whatever) would make some
sense as well. Maybe this could be used as a way to fix the seed
explicitely, eg:

pgbench -D hash_seed=1234 ...

Would use this value instead of the random generated one. Also, with that
the default inserted second argument could be simply ":hash_seed", which
would simplify the executor which would not have to do check for an
optional second argument.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marina Polyakova 2018-01-13 09:40:33 Re: master make check fails on Solaris 10
Previous Message Tomas Vondra 2018-01-13 04:19:27 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions