| From: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> | 
|---|---|
| To: | Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> | 
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: General purpose hashing func in pgbench | 
| Date: | 2017-12-20 07:36:27 | 
| Message-ID: | alpine.DEB.2.20.1712200807550.5165@lancre | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hello Ildar,
> Following up the recent discussion on zipfian distribution I was trying
> to reproduce some YCSB-like workloads. As this paper [1] describes, YCSB
> uses zipfian distribution to generate keys in order simulate intensive
> load on small number of records as it happens in real world applications
> (e.g. blogs). One problem is that most popular records keys are
> clustered together. To scatter them across the keyspace authors use
> hashing, the FNV-1a hash function in particular [2].
Yes, clustering is an issue for realistic load tests and may influence the 
resulting measured performance unduly.
> I've read Fabien Coelho's thread on additional operators and functions.
> Generally it could be possible to implement some basic hashing
> algorithms right in a pgbench script using just bitwise and arithmetic
> operators. But should we probably provide users with some general
> purpose hash function?
The problem I see with hash functions is that it generates collisions, 
which is an undesirable effect when dealing with keys as it biases the 
initial randomization.
Thus I intend to submit a parametric pseudo-random permutation function, 
some day. As I foresaw that it might require some arguing I did not 
include the function in the operators and functions collection I 
submitted, so as not to slow down the inclusion process. Given that the 
patch was submitted over 20 months ago and is still stuck in the CF, that 
was a misjudgement:-)
This is no no way a point against including one/some hash functions, 
especially if they actually appear in a benchmark.
> The attached patch introduces hash() function which implements FNV-1a as
> an example of such hashing algorithm. There are also couple of images in
> the attachement that I have got from visualizing original zipfian
> distribution and the hashed one. Usage example:
> In psql:
> create table abc as select generate_series(0, 999) as a, 0 as b;
>
> pgbench script:
> \set rnd random_zipfian(0, 1000000, 0.99)
> \set key abs(hash(:rnd)) % 1000
> begin;
> update abc set b = b + 1 where a = :key;
> end;
>
> Any thoughts or suggestions?
As there may be several hash functions included in the long run. I'd 
suggest that the hash function should be named more precisely, eg 
"hash_fnv1a".
The image looks like the distribution is more regularly scattered than 
actually randomized... Maybe this is because the first highest 256 values 
are really scattered by the process multiply/modulo process. Or maybe this 
is an optical effect?
ISTM that there are undesired utf8 chars in a comment. Should be kept 
ASCII.
I would put the actual hash computation in a separate function rather than 
inlined in the evaluator.
Add the submission to the next CF?
-- 
Fabien.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Langote | 2017-12-20 08:27:20 | Re: [HACKERS] path toward faster partition pruning | 
| Previous Message | Masahiko Sawada | 2017-12-20 07:06:46 | Re: User defined data types in Logical Replication |