From: | Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> |
---|---|
To: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: General purpose hashing func in pgbench |
Date: | 2018-01-16 14:49:23 |
Message-ID: | bca39e87-6d40-98f6-e4ef-e2d88e109f9c@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Fabien,
13/01/2018 11:16, Fabien COELHO пишет:
>
> Hello Ildar,
>
>>> so that different instances of hash function within one script would
>>> have different seeds. Yes, that is a good idea, I can do that.
>>>
>> Added this feature in attached patch. But on a second thought this could
>> be something that user won't expect. For example, they may want to run
>> pgbench with two scripts:
>> - the first one updates row by key that is a hashed random_zipfian
>> value;
>> - the second one reads row by key generated the same way
>> (that is actually what YCSB workloads A and B do)
>>
>> It feels natural to write something like this:
>> \set rnd random_zipfian(0, 1000000, 0.99)
>> \set key abs(hash(:rnd)) % 1000
>> in both scripts and expect that they both would have the same
>> distribution. But they wouldn't. We could of course describe this
>> implicit behaviour in documentation, but ISTM that shared seed would be
>> more clear.
>
> I think that it depends on the use case, that both can be useful, so
> there should be a way to do either.
>
> With "always different" default seed, distinct distributions are achieved
> with:
>
> -- DIFF distinct seeds inside and between runs
> \set i1 abs(hash(:r1)) % 1000
> \set j1 abs(hash(:r2)) % 1000
>
> and the same distribution can be done with an explicit seed:
>
> -- DIFF same seed inside and between runs
> \set i1 abs(hash(:r1), 5432) % 1000
> \set j1 abs(hash(:r2), 5432) % 1000
>
> The drawback is that the same seed is used between runs in this case,
> which is not desirable. This could be circumvented by adding the
> random seed as an automatic variable and using it, eg:
>
> -- DIFF same seed inside run, distinct between runs
> \set i1 abs(hash(:r1), :random_seed + 5432) % 1000
> \set j1 abs(hash(:r2), :random_seed + 2345) % 1000
>
>
> Now with a shared hash_seed the same distribution is by default:
>
> -- SHARED same underlying hash_seed inside run, distinct between runs
> \set i1 abs(hash(:r1)) % 1000
> \set j1 abs(hash(:r2)) % 1000
>
> However some trick is needed now to get distinct seeds. With
>
> -- SHARED distinct seed inside run, but same between runs
> \set i1 abs(hash(:r1, 5432)) % 1000
> \set j1 abs(hash(:r2, 2345)) % 1000
>
> We are back to the same issue has the previous case because then the
> distribution is the same from one run to the next, which is not
> desirable. I found this workaround trick:
>
> -- SHARED distinct seeds inside and between runs
> \set i1 abs(hash(:r1, hash(5432))) % 1000
> \set j1 abs(hash(:r2, hash(2345))) % 1000
>
> Or with a new :hash_seed or :random_seed automatic variable, we could
> also have:
>
> -- SHARED distinct seeds inside and between runs
> \set i1 abs(hash(:r1, :hash_seed + 5432)) % 1000
> \set j1 abs(hash(:r2, :hash_seed + 2345)) % 1000
>
> It provides controllable distinct seeds between runs but equal one
> between if desired, by reusing the same value to be hashed as a seed.
>
> I also agree with your argument that the user may reasonably expect
> that hash(5432) == hash(5432) inside and between scripts, at least on
> the same run, so would be surprised that it is not the case.
>
> So I've changed my mind, I'm sorry for making you going back and forth
> on the subject. I'm now okay with one shared 64 bit hash seed, with a
> clear documentation about the fact, and an outline of the trick to
> achieve distinct distributions inside a run if desired and why it
> would be desirable to avoid correlations. Also, I think that providing
> the seed as automatic variable (:hash_seed or :hseed or whatever)
> would make some sense as well. Maybe this could be used as a way to
> fix the seed explicitely, eg:
>
> pgbench -D hash_seed=1234 ...
>
> Would use this value instead of the random generated one. Also, with
> that the default inserted second argument could be simply
> ":hash_seed", which would simplify the executor which would not have
> to do check for an optional second argument.
>
Here is a new version of patch. I've splitted it into two parts. The
first one is almost the same as v4 from [1] with some refactoring. The
second part introduces random_seed variable as you proposed. I didn't do
the executor simplification thing yet because I'm a little concerned
about inventive users, who may want to change random_seed variable in
runtime (which is possible since pgbench doesn't have read only
variables aka constants AFAIK).
[1]
https://www.postgresql.org/message-id/43a8fbbb-32fa-6478-30a9-f64041adf019%40postgrespro.ru
--
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
001_pgbench_hash_v7.patch | text/plain | 10.6 KB |
002_pgbench_hash_seed.patch | text/plain | 4.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2018-01-16 14:55:16 | Re: [HACKERS] generated columns |
Previous Message | Andrew Dunstan | 2018-01-16 14:30:22 | Re: [HACKERS] Deadlock in XLogInsert at AIX |