Quick Links

Re: [WIP] Zipfian distribution in pgbench

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Alik Khilazhev <a(dot)khilazhev(at)postgrespro(dot)ru>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Zipfian distribution in pgbench
Date:	2017-07-07 11:45:15
Message-ID:	alpine.DEB.2.20.1707071317250.23534@lancre
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello Alik,

> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50%
> UPDATE of random row by PK) on benchmarking with big number of clients
> using Zipfian distribution. MySQL also has decline but it is not
> significant as it is in PostgreSQL. MongoDB does not have decline at
> all. And if pgbench would have Zipfian distribution random number
> generator, everyone will be able to make research on this topic without
> using YCSB.

Your description is not very precise. What version of Postgres is used? If
there is a decline, compared to which version? Is there a link to these
results?

> This is the reason why I am currently working on random_zipfian function.

> The bottleneck of algorithm that I use is that it calculates zeta
> function (it has linear complexity -
> https://en.wikipedia.org/wiki/Riemann_zeta_function) It my cause
> problems on generating huge amount of big numbers.

Indeed, the function computation is over expensive, and the numerical
precision of the implementation is doubtful.

If there is no better way to compute this function, ISTM that it should be
summed in reverse order to accumulate small values first, from (1/n)^s +
... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in
calling pow for this one, so it could be:

double ans = 0.0;
for (i = n; i >= 2; i--)
ans += pow(1. / i, theta);
return 1.0 + ans;

> That’s why I added caching for zeta value. And it works good for cases
> when random_zipfian called with same parameters in script. For example:

> That’s why I have a question: should I implement support of caching zeta
> values for calls with different parameters, or not?

I do not envision the random_zipfian function to be used widely, but if it
is useful to someone this is fine with me. Could an inexpensive
exponential distribution be used instead for the same benchmarking
purpose?

If the functions when actually used is likely to be called with different
parameters, then some caching beyond the last value would seem in order.
Maybe a small fixed size array?

However, it should be somehow thread safe, which does not seem to be the
case with the current implementation. Maybe a per-thread cache? Or use a
lock only to update a shared cache? At least it should avoid locking to
read values...

> P.S. I attaching patch and script - analogue of YCSB Workload A.
> Run benchmark with command:
> $ pgbench -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql

Given the explanations, the random draw mostly hits values at the
beginning of the interval, so when the number of client goes higher one
just get locking contention on the updated row?

ISTM that also having the tps achieved with a flat distribution would
allow to check this hypothesis.

> On scale = 10(1 million rows) it gives following results on machine with
> 144 cores(with synchronous_commit=off):

> nclients tps
> 1 8842.401870
> 2 18358.140869
> 4 45999.378785
> 8 88713.743199
> 16 170166.998212
> 32 290069.221493
> 64 178128.030553
> 128 88712.825602
> 256 38364.937573
> 512 13512.765878
> 1000 6188.136736

--
Fabien.

In response to

[WIP] Zipfian distribution in pgbench at 2017-07-07 07:45:29 from Alik Khilazhev

Responses

Re: [WIP] Zipfian distribution in pgbench at 2017-07-10 06:49:48 from Alik Khilazhev

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Simon Riggs	2017-07-07 11:47:20	Re: New partitioning - some feedback
Previous Message	Robert Haas	2017-07-07 11:29:06	Re: Revisiting NAMEDATALEN