Re: General purpose hashing func in pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: General purpose hashing func in pgbench
Date: 2017-12-20 07:36:27
Message-ID: alpine.DEB.2.20.1712200807550.5165@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Ildar,

> Following up the recent discussion on zipfian distribution I was trying
> to reproduce some YCSB-like workloads. As this paper [1] describes, YCSB
> uses zipfian distribution to generate keys in order simulate intensive
> load on small number of records as it happens in real world applications
> (e.g. blogs). One problem is that most popular records keys are
> clustered together. To scatter them across the keyspace authors use
> hashing, the FNV-1a hash function in particular [2].

Yes, clustering is an issue for realistic load tests and may influence the
resulting measured performance unduly.

> I've read Fabien Coelho's thread on additional operators and functions.
> Generally it could be possible to implement some basic hashing
> algorithms right in a pgbench script using just bitwise and arithmetic
> operators. But should we probably provide users with some general
> purpose hash function?

The problem I see with hash functions is that it generates collisions,
which is an undesirable effect when dealing with keys as it biases the
initial randomization.

Thus I intend to submit a parametric pseudo-random permutation function,
some day. As I foresaw that it might require some arguing I did not
include the function in the operators and functions collection I
submitted, so as not to slow down the inclusion process. Given that the
patch was submitted over 20 months ago and is still stuck in the CF, that
was a misjudgement:-)

This is no no way a point against including one/some hash functions,
especially if they actually appear in a benchmark.

> The attached patch introduces hash() function which implements FNV-1a as
> an example of such hashing algorithm. There are also couple of images in
> the attachement that I have got from visualizing original zipfian
> distribution and the hashed one. Usage example:

> In psql:
> create table abc as select generate_series(0, 999) as a, 0 as b;
>
> pgbench script:
> \set rnd random_zipfian(0, 1000000, 0.99)
> \set key abs(hash(:rnd)) % 1000
> begin;
> update abc set b = b + 1 where a = :key;
> end;
>
> Any thoughts or suggestions?

As there may be several hash functions included in the long run. I'd
suggest that the hash function should be named more precisely, eg
"hash_fnv1a".

The image looks like the distribution is more regularly scattered than
actually randomized... Maybe this is because the first highest 256 values
are really scattered by the process multiply/modulo process. Or maybe this
is an optical effect?

ISTM that there are undesired utf8 chars in a comment. Should be kept
ASCII.

I would put the actual hash computation in a separate function rather than
inlined in the evaluator.

Add the submission to the next CF?

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-12-20 08:27:20 Re: [HACKERS] path toward faster partition pruning
Previous Message Masahiko Sawada 2017-12-20 07:06:46 Re: User defined data types in Logical Replication