Re: pgbench gaussian/exponential docs improvements

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgbench gaussian/exponential docs improvements
Date: 2015-10-26 06:29:27
Message-ID: alpine.DEB.2.10.1510260711240.24734@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


>> I was not only thinking of mathematical figures, I was also thinking of
>> graphics, some format may be zip containing XML stuff for instance.
>
> But we don't need it here, so why should we care about it too much?

I was just digressing about the main subject:-) Having some graphics in
the doc would help here and there, though.

> I do understand that. I'm trying to explain that "threshold" is in fact
> completely disconnected from min and max, as the transformation scales the
> data to [-1,1] like this
>
> 2.0 * (i - min - mu + 0.5) / (max - min + 1)
>
> and only then the 'threshold' coefficient is applied. And if I read the
> Box-Muller transformation correctly, it generates data with standard Normal
> distribution from [-threshold, threshold] and then transforms them to the
> right mean etc.

Yep, the threshold parameter is designed to be somehow independent of the
actual [min max] range.

> But maybe that's what the first sentence is trying to say? I mean this:
>
> For a Gaussian distribution, the interval is mapped onto a standard
> normal distribution (the classical bell-shaped Gaussian curve)
> truncated at -threshold on the left and +threshold on the right.

Yep, that looks like it.

> I'm asking about this because it wasn't to me immediately clear whether I
> need to tweak this for data sets with different scales, but apparently not.

Indeed, This is the idea of how the parameter is used.

> After reading the docs again I think that's also clear from last sentence
> that relates threshold and 67% and 95%.

Yep.

> Anyway, the references to "standard normal distribution" are a bit sloppy -
> "standard" usually means normal distribution with exactly mu=0 and sigma=1.
> So it's a bit strange to say
>
> standard normal distribution, with mean mu defined as (max+min)/2.0
>
> because that's not a standard normal distribution at all. I propose to fix
> this by removing the "standard".

Hmmm, probably fine if it is both more precise and shorter!

> [...]
> CDF2(x) = PHI(2.0 * threshold * ...) / (2.0 * PHI(threshold) - 1.0)
>
> and then the probability of "i" is
>
> P(X=i) = CDF2(i+0.5) - CDF2(i-0.5)

I agree that defining the shifted/scaled CDF and using it afterwards looks
cleaner.

> Which is what I meant by simplifying the equation. Not that it'd make easier
> to imagine the shape, though ...

Sure. This is the part about providing the "precise" information, what is
the actual probability of drawing i depending on the parameters.

> Maybe. Another thing is that "middle quarter" and "middle half" seems a bit
> strange - if you split data into 1/4s there's no middle one (sure, I
> understand what the sentence is meant to say).

Improvements are welcome!

>> Ok. I think that the fact that it relies on the Box-Muller transform is
>> relevant, because there are other methods to generate a gaussian
>> distribution, and I would say that there is no reason to have to go to
>> the source code to check that. But I would not provide further details.
>> So I'm fine with the current status.
>
> There are alternative methods for almost every non-trivial piece of code, and
> we generally don't mention that in user docs. Why should we mention it in
> this case? Why would the user care which particular PRNG was used to generate
> the numbers? Maybe there really is a reason for that, I don't know.

If that was security, because one has just been announced to be broken and
you want to know whether you depend on it.

As a scientist, I like it when follow scientists who achieved useful
things have their name cited:-).

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2015-10-26 06:37:55 Re: questions about PG update performance
Previous Message Craig Ringer 2015-10-26 05:51:10 Re: PATCH: 9.5 replication origins fix for logical decoding