Re: General purpose hashing func in pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: General purpose hashing func in pgbench
Date: 2018-01-17 07:52:55
Message-ID: alpine.DEB.2.20.1801170727240.11884@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Ildar,

> Here is a new version of patch. I've splitted it into two parts. The
> first one is almost the same as v4 from [1] with some refactoring. The
> second part introduces random_seed variable as you proposed.

Patch 1 applies. Compilations fails, there are two "hash_seed" declared in
"pgbench.c".

Patch 2 applies cleanly on top of the previous patch and compiles, because
the variable is removed...

If an ":hash_seed" pgbench variable is used, ISTM that there is no need
for a global variable at all, so the two patches are going back and forth,
which is unhelpful. ISTM better to provide just one combined patch for the
feature.

If the hash_seed variable really needs to be kept, it should be an "int64"
variable, like other pgbench values. Calling random just usually
initializes about 31 bits, so random should be called 2 or maybe 3 times?
Or maybe use the internal getrand which has 48 pseudorandom bits?

For me "random seed" is what is passed to srandom, so the variable should
rather be named hash_seed because there could also be a random seed
(actually, there is in another parallel patch:-). Moreover, this seed may
or may not be random if set, so calling it "random_seed" is not desirable.

The len < 1 || len > 2 is checked twice, once in the "switch", on in an
"if" just after the "switch". Once is enough.

> I didn't do the executor simplification thing yet because I'm a little
> concerned about inventive users, who may want to change random_seed
> variable in runtime (which is possible since pgbench doesn't have read
> only variables aka constants AFAIK).

If the user choses to overide hash_seed in their script, it is their
decision, the documentation has only to be clear about :hash_seed being
the default seed. I see no clear reason to work around this possibility by
evaluating the seed at parse time, especially as the variable may not have
its final value yet depending on the option order. I'd suggest to just use
make_variable("hash_seed") for the default second argument and simplify
the executor.

The seed variable is not tested explicitely in the script, you could add
a "hash(5432) == hash(5432, :hash_seed)" for instance.

It would be nice if an native English speaker could proofread the
documentation text. I'd suggest: "*an* optional seed parameter". "In case
*the* seed...". "<literal>:hash_seed</literal>". "shared for" -> "shared
by". "following listing" -> "following pgbench script". "few accounts
generates" -> "few accounts generate".

For the document example, I'd use larger values for the random & modulo,
eg 100000000 and 1000000. The drawback is that zipfian does a costly
computation of the generalized harmonic number when the parameter is lower
than 1.0. For cities, the parameter found by Zipf was 1.07 (says
Wikipedia). Maybe use this historical value. Or maybe use an exponential
distribution in the example.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marina Polyakova 2018-01-17 07:59:22 Re: WIP Patch: Precalculate stable functions, infrastructure v1
Previous Message Ashutosh Sharma 2018-01-17 07:34:01 Test-cases for exclusion constraints is missing in alter_table.sql file