Re: CPU costs of random_zipfian in pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Georgios Kokolatos <gkokolatos(at)pm(dot)me>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Fabien Coelho <postgresql(dot)org(at)coelho(dot)net>
Subject: Re: CPU costs of random_zipfian in pgbench
Date: 2019-03-24 14:34:33
Message-ID: alpine.DEB.2.21.1903241513570.9939@lancre
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers


Hello Tomas,

>>> What would a user do with this information, and how would they know
>>> what to do?
>>
>> Sure, but it was unclear what to do. Extending the cache to avoid that
>> would look like over-engineering.
>
> That seems like a rather strange argument. What exactly is so complex on
> resizing the cache to quality as over-engineering?

Because the cache size can diverge on a bad script, and the search is
currently linear. Even with a log search it does not look attractive.

> If the choice is between reporting the failure to the user, and
> addressing the failure, surely the latter would be the default option?
> Particularly if the user can't really address the issue easily
> (recompiling psql is not very practical solution).
>
>>> I remain of the opinion that we ought to simply rip out support for
>>> zipfian with s < 1.
>
> +1 to that

If this is done, some people with zipfian distribution that currently
work might be unhappy.

>> Some people really want zipfian because it reflects their data access
>> pattern, possibly with s < 1.
>>
>> We cannot helpt it: real life seems zipfianly distributed:-)
>
> Sure. But that hardly means we ought to provide algorithms that we know
> are not suitable for benchmarking tools, which I'd argue is this case.

Hmmm. It really depends on the values actually used.

> Also, we have two algorithms for generating zipfian distributions. Why
> wouldn't it be sufficient to keep just one of them?

Because the two methods work for different values of the parameter, so it
depends on the distribution which is targetted. If you want s < 1, the
exclusion method does not help (precisely, the real "infinite" zipfian
distribution is not mathematical defined when s < 1 because the underlying
sum diverges. Having s < 1 can only be done on a finite set).

>>> It's not useful for benchmarking purposes to have a random-number
>>> function with such poor computational properties.
>>
>> This is mostly a startup cost, the generation cost when a bench is
>> running is reasonable. How to best implement the precomputation is an
>> open question.
>
> ... which means it's not a startup cost. IMHO this simply shows pgbench
> does not have the necessary infrastructure to provide this feature.

Well, a quick one has been proposed with a cache. Now I can imagine
alternatives, but I'm wondering how much it is worth it.

Eg, restraint random_zipfian to more or less constant parameters when s <
1, so that a partial evaluation could collect the needed pairs and perform
the pre-computations before the bench is started.

Now, that looks like over-engineering, and then someone would complain of
the restrictions.

>> As a reviewer I was not thrilled by the cache stuff, but I had no better
>> idea that would not fall under "over-over engineering" or the like.
>>
>> Maybe it could error out and say "recompile me", but then someone
>> would have said "that is unacceptable".
>>
>> Maybe it could auto extend the cache, but that is still more unnecessary
>> over-engineering, IMHO.
>
> I'm puzzled. Why would that be over-engineering?

As stated above, cache size divergence and O(n) search. Even a log2(n)
search would be bad, IMO.

>>> I think leaving it in there is just a foot-gun: we'd be a lot better
>>> off throwing an error that tells people to use some other distribution.
>>
>> When s < 1, the startup cost is indeed a pain. However, it is a pain
>> prescribed by a Turing Award.
>
> Then we shouldn't have it, probably. Or we should at least implement a
> proper startup phase, so that the costly precomputation is not included
> in the test interval.

That can be done, but I'm not sure of an very elegant design. And I was
just the reviewer on this one.

>>> Or if we do leave it in there, we for sure have to have documentation
>>> that *actually* explains how to use it, which this patch still doesn't.
>>
>> I'm not sure what explaining there could be about how to use it: one
>> calls the function to obtain pseudo-random integers with the desired
>> distribution?
>
> Well, I'd argue the current description "performance is poor" is not
> particularly clear.
>
>>> There's nothing suggesting that you'd better not use a large number of
>>> different (n,s) combinations.
>>
>> Indeed, there is no caveat about this point, as noted above.
>>> Please find an updated patch for the documentation, pointing out the
>> existence of the cache and an advice not to overdo it.
>>> It does not solve the underlying problem which raised your complaint,
>> but at least it is documented.
>>
>
> I think you forgot to attache the patch ...

Indeed, this is becoming a habbit:-( Attached. Hopefully.

--
Fabien.

Attachment Content-Type Size
pgbench-zipf-doc-2.patch text/x-diff 3.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-03-24 14:41:10 Re: chained transactions
Previous Message Fabien COELHO 2019-03-24 14:12:58 Re: CPU costs of random_zipfian in pgbench