|From:||Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>|
|To:||Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>|
|Cc:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Georgios Kokolatos <gkokolatos(at)pm(dot)me>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Fabien Coelho <postgresql(dot)org(at)coelho(dot)net>|
|Subject:||Re: CPU costs of random_zipfian in pgbench|
|Views:||Raw Message | Whole Thread | Download mbox|
>>> What would a user do with this information, and how would they know
>>> what to do?
>> Sure, but it was unclear what to do. Extending the cache to avoid that
>> would look like over-engineering.
> That seems like a rather strange argument. What exactly is so complex on
> resizing the cache to quality as over-engineering?
Because the cache size can diverge on a bad script, and the search is
currently linear. Even with a log search it does not look attractive.
> If the choice is between reporting the failure to the user, and
> addressing the failure, surely the latter would be the default option?
> Particularly if the user can't really address the issue easily
> (recompiling psql is not very practical solution).
>>> I remain of the opinion that we ought to simply rip out support for
>>> zipfian with s < 1.
> +1 to that
If this is done, some people with zipfian distribution that currently
work might be unhappy.
>> Some people really want zipfian because it reflects their data access
>> pattern, possibly with s < 1.
>> We cannot helpt it: real life seems zipfianly distributed:-)
> Sure. But that hardly means we ought to provide algorithms that we know
> are not suitable for benchmarking tools, which I'd argue is this case.
Hmmm. It really depends on the values actually used.
> Also, we have two algorithms for generating zipfian distributions. Why
> wouldn't it be sufficient to keep just one of them?
Because the two methods work for different values of the parameter, so it
depends on the distribution which is targetted. If you want s < 1, the
exclusion method does not help (precisely, the real "infinite" zipfian
distribution is not mathematical defined when s < 1 because the underlying
sum diverges. Having s < 1 can only be done on a finite set).
>>> It's not useful for benchmarking purposes to have a random-number
>>> function with such poor computational properties.
>> This is mostly a startup cost, the generation cost when a bench is
>> running is reasonable. How to best implement the precomputation is an
>> open question.
> ... which means it's not a startup cost. IMHO this simply shows pgbench
> does not have the necessary infrastructure to provide this feature.
Well, a quick one has been proposed with a cache. Now I can imagine
alternatives, but I'm wondering how much it is worth it.
Eg, restraint random_zipfian to more or less constant parameters when s <
1, so that a partial evaluation could collect the needed pairs and perform
the pre-computations before the bench is started.
Now, that looks like over-engineering, and then someone would complain of
>> As a reviewer I was not thrilled by the cache stuff, but I had no better
>> idea that would not fall under "over-over engineering" or the like.
>> Maybe it could error out and say "recompile me", but then someone
>> would have said "that is unacceptable".
>> Maybe it could auto extend the cache, but that is still more unnecessary
>> over-engineering, IMHO.
> I'm puzzled. Why would that be over-engineering?
As stated above, cache size divergence and O(n) search. Even a log2(n)
search would be bad, IMO.
>>> I think leaving it in there is just a foot-gun: we'd be a lot better
>>> off throwing an error that tells people to use some other distribution.
>> When s < 1, the startup cost is indeed a pain. However, it is a pain
>> prescribed by a Turing Award.
> Then we shouldn't have it, probably. Or we should at least implement a
> proper startup phase, so that the costly precomputation is not included
> in the test interval.
That can be done, but I'm not sure of an very elegant design. And I was
just the reviewer on this one.
>>> Or if we do leave it in there, we for sure have to have documentation
>>> that *actually* explains how to use it, which this patch still doesn't.
>> I'm not sure what explaining there could be about how to use it: one
>> calls the function to obtain pseudo-random integers with the desired
> Well, I'd argue the current description "performance is poor" is not
> particularly clear.
>>> There's nothing suggesting that you'd better not use a large number of
>>> different (n,s) combinations.
>> Indeed, there is no caveat about this point, as noted above.
>>> Please find an updated patch for the documentation, pointing out the
>> existence of the cache and an advice not to overdo it.
>>> It does not solve the underlying problem which raised your complaint,
>> but at least it is documented.
> I think you forgot to attache the patch ...
Indeed, this is becoming a habbit:-( Attached. Hopefully.
|Next Message||Fabien COELHO||2019-03-24 14:41:10||Re: chained transactions|
|Previous Message||Fabien COELHO||2019-03-24 14:12:58||Re: CPU costs of random_zipfian in pgbench|