|From:||Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>|
|To:||Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>|
|Subject:||Re: General purpose hashing func in pgbench|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
> Following up the recent discussion on zipfian distribution I was trying
> to reproduce some YCSB-like workloads. As this paper  describes, YCSB
> uses zipfian distribution to generate keys in order simulate intensive
> load on small number of records as it happens in real world applications
> (e.g. blogs). One problem is that most popular records keys are
> clustered together. To scatter them across the keyspace authors use
> hashing, the FNV-1a hash function in particular .
Yes, clustering is an issue for realistic load tests and may influence the
resulting measured performance unduly.
> I've read Fabien Coelho's thread on additional operators and functions.
> Generally it could be possible to implement some basic hashing
> algorithms right in a pgbench script using just bitwise and arithmetic
> operators. But should we probably provide users with some general
> purpose hash function?
The problem I see with hash functions is that it generates collisions,
which is an undesirable effect when dealing with keys as it biases the
Thus I intend to submit a parametric pseudo-random permutation function,
some day. As I foresaw that it might require some arguing I did not
include the function in the operators and functions collection I
submitted, so as not to slow down the inclusion process. Given that the
patch was submitted over 20 months ago and is still stuck in the CF, that
was a misjudgement:-)
This is no no way a point against including one/some hash functions,
especially if they actually appear in a benchmark.
> The attached patch introduces hash() function which implements FNV-1a as
> an example of such hashing algorithm. There are also couple of images in
> the attachement that I have got from visualizing original zipfian
> distribution and the hashed one. Usage example:
> In psql:
> create table abc as select generate_series(0, 999) as a, 0 as b;
> pgbench script:
> \set rnd random_zipfian(0, 1000000, 0.99)
> \set key abs(hash(:rnd)) % 1000
> update abc set b = b + 1 where a = :key;
> Any thoughts or suggestions?
As there may be several hash functions included in the long run. I'd
suggest that the hash function should be named more precisely, eg
The image looks like the distribution is more regularly scattered than
actually randomized... Maybe this is because the first highest 256 values
are really scattered by the process multiply/modulo process. Or maybe this
is an optical effect?
ISTM that there are undesired utf8 chars in a comment. Should be kept
I would put the actual hash computation in a separate function rather than
inlined in the evaluator.
Add the submission to the next CF?
|Next Message||Amit Langote||2017-12-20 08:27:20||Re: [HACKERS] path toward faster partition pruning|
|Previous Message||Masahiko Sawada||2017-12-20 07:06:46||Re: User defined data types in Logical Replication|