Re: pgbench gaussian/exponential docs improvements

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgbench gaussian/exponential docs improvements
Date: 2015-10-25 21:01:37
Message-ID: alpine.DEB.2.10.1510252141040.24734@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> [...]
>
> So either the information is important and then should be placed in the
> docs directly, or it's not and then linking to wikipedia is pointless
> because the users are not interested in learning all the details about
> each distribution function.

What is important is that these distributions can be used from pgbench.
What is a gaussian or an exponential distribution is *not* important as
such.

For me it is not the point of pg documentation to explain probability
theory, but just to provide *precise* information about what is actually
available, for someone who would be interested, without having to read the
source code. At least that is the idea behind the current documentation.

>>> Firstly, it'd be nice if we could add some figures illustrating the
>>> distributions - much better than explaining the shapes in text. I
>>> don't know if we include figures in the existing docs (probably not),
>>> but generating the figures is rather simple.
>>
>> There is basically no figures in the documentation. Too bad, but it is
>> understandable: what should be the format (svg, jpg, png, ...), should
>> it be generated (gnuplot, others), what is the impact on the
>> documentation build (html, epub, pdf, ...), how portable should it be,
>> what about compressed formats vs git diffs?
>>
>> Once you start asking these questions you understand why there are no
>> figures:-)
>
> I don't see why diffs would be a problem.

I was not only thinking of mathematical figures, I was also thinking of
graphics, some format may be zip containing XML stuff for instance.

>>> Probably nitpicking, but left/right of what? I assume the normal
>>> distribution is placed at 0, so it's left/right of zero.
>>
>> No, it is around the middle of the interval.
>
> You mean [min,max] interval?

Yep.

> I believe the transformation
>
> 2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)
>
> essentially moves the mean into 0, scales the data to [0,1] and then applies
> the threshold.

Probably:-) I wrote that some time ago, and it is 10 pm for me:-).

> In other words, the general shape of the curve will be exactly the same no
> matter the actual min/max (except that for longer intervals the values will
> be lower, as there are more possible values).
>
> I don't really see how it's related to this?
>
> [(max-min)/2 - thresholds, (max-min)/2 + threshold]

The gaussian distribution is about reals, but it is used for integers, so
there is a projection on integers from the real values. The function
should compute the probability of drawing a given integer "i" in the
interval, that is given min, max and threshold, what is the probability of
drawing i.

>>> Could we simplify the equation a bit? It's needlessly difficult to
>>> realize it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be
>>> good to first define the CDF and then just use that.
>>
>> ISTM that PHI is *the* normal CDF, which is more or less available as
>> such in various environment (matlab, python, excel...). Well, why not
>> defined the particular CDF and use it. Not sure the text would be that
>> much lighter, though.
>
> PHI is the CDF of the normal distribution, not the modified probability
> distribution here (with threshold and scaled to the desired interval).

Yep, that is exactly what I was saying, I think.

>>> This seems broken - too many sentences about the 67% and 95%.
>>
>> The point is to provide rules of thumb to describe how the distribution
>> is shaped. Any better sentence is welcome.
>
> Ah, I misread the sentence initially. I haven't realized it speaks about
> 1/threshold in the first part, and the second part is an example for
> threshold=4.0. So I thought it's a repetition of the first part.

Maybe it needs spacing and colons and rewording, if it too hard to parse.

>>> Does it make sense to explicitly mention the implementation detail
>>> (Box-Muller transform) here?
>
> No, my point was exactly the opposite - removing the mention of Box-Muller
> entirely, not adding more details about it.

Ok. I think that the fact that it relies on the Box-Muller transform is
relevant, because there are other methods to generate a gaussian
distribution, and I would say that there is no reason to have to go to the
source code to check that. But I would not provide further details. So I'm
fine with the current status.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zeus Kronion 2015-10-25 21:55:43 WIP: Fix parallel workers connection bug in pg_dump (Bug #13727)
Previous Message Tomas Vondra 2015-10-25 20:33:36 Re: pgbench gaussian/exponential docs improvements