Re: CPU costs of random_zipfian in pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CPU costs of random_zipfian in pgbench
Date: 2019-02-17 08:51:59
Message-ID: alpine.DEB.2.21.1902170907110.8818@lancre
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers


Hello Tomas,

> I'm trying to use random_zipfian() for benchmarking of skewed data sets,
> and I ran head-first into an issue with rather excessive CPU costs.
> [...] This happens because generalizedHarmonicNumber() does this:
>
> for (i = n; i > 1; i--)
> ans += pow(i, -s);
>
> where n happens to be 1000000000 (range passed to random_zipfian), so
> the loop takes quite a bit of time.

If you find a better formula for the harmonic number, you are welcome
and probably get your name on it:-)

> So much that we only ever complete a
> single transaction, because this work happens in the context of the
> first transction, and so it counts into the 30-second limit.

That is why the value is cached, so it is done once per size & value.

If you want skewed but not especially zipfian, use exponential which is
quite cheap. Also zipfian with a > 1.0 parameter does not have to compute
the harmonic number, so it depends in the parameter.

Please also beware that non uniform keys are correlated (the more frequent
are close values), which is somewhat non representative of what you would
expect in real life. This is why I submitted a pseudo-random permutation
function, which alas does not get much momentum from committers.

> The docs actually do mention performance of this function:
>
> The function's performance is poor for parameter values close and
> above 1.0 and on a small range.
>
> But that does not seem to cover the example I just posted, because 0.1
> is not particularly close to 1.0 (or above 1.0), and 1e9 values hardly
> counts as "small range".

Yep. The warning is about the repeated cost of calling random_zipfian,
which is not good when the parameter is close to and above 1.0. It is not
about the setup cost when the value is between 0 and &. This could indeed
deserve a warning.

Now if you do benchmarking on a database, probably you want to run hours
of to level out checkpointing and other background tasks, so the setup
cost should be negligeable, in the end...

> I see this part of random_zipfian comes from "Quickly Generating
> Billion-Record Synthetic Databases" which deals with generating data
> sets, where wasting a bit of time is not a huge deal. But when running
> benchmarks it matters much more. So maybe there's a disconnect here?

Hmmm. The first author of this paper got a Turing award:-) The disconnect
is just that the setup cost is neglected, or computed offline.

> Interestingly enough, I don't see this issue on values above 1.0, no
> matter how close to 1.0 those are. Although the throughput seems lower
> than with uniform access, so this part of random_zipfian is not quite
> cheap either.

Indeed. Pg provides an underlying uniform pseudo-random function, so
generating uniform is cheap. Others need more or less expensive
transformations.

> Now, I don't know what to do about this. Ideally, we'd have a faster
> algorithm to generate zipfian distributions

You are welcome to find one, and get famous (hmmm... among some
specialized mathematicians at least:-) for it.

> - I don't know if such thing exists though. Or maybe we could precompute
> the distribution first, not counting it into the benchmark duration.

Could be done, but this would require significant partial evaluation
efforts and only work when the size and parameter are constants (although
using the function with variable parameters would be a very bad idea
anyway).

> But I guess the very least we can/should do is improving the docs to
> make it more obvious which cases are expected to be slow.

Yep. Attached is a doc & comment improvement.

--
Fabien.

Attachment Content-Type Size
pgbench-zipf-doc-1.patch text/x-diff 3.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2019-02-17 08:55:05 Re: amcheck verification for GiST
Previous Message Ramanarayana 2019-02-17 07:15:39 Re: BUG #15548: Unaccent does not remove combining diacritical characters