Re: pgbench - add pseudo-random permutation function

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Hironobu SUZUKI <hironobu(at)interdb(dot)jp>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: pgbench - add pseudo-random permutation function
Date: 2020-02-01 09:07:13
Message-ID: alpine.DEB.2.21.2002010916510.20752@pseudo
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Peter,

>>> This patch was marked as RFC on 2019-03-30, but since then there have
>>> been a couple more issues pointed out in a review by Thomas Munro, and
>>> it went through 2019-09 and 2019-11 without any attention. Is the RFC
>>> status still appropriate?
>>
>> Thomas review was about comments/documentation wording and asking for
>> explanations, which I think I addressed, and the code did not actually
>> change, so I'm not sure that the "needs review" is really needed, but do
>> as you feel.
>
> I read the whole thread, I still don't know what this patch is supposed to
> do. I know what the words in the subject line mean, but I don't know how
> this helps a pgbench user run better benchmarks. I feel this is also the
> sentiment expressed by others earlier in the thread. You indicated that this
> functionality makes sense to those who want this functionality, but so far
> only two people, namely the patch author and the reviewer, have participated
> in the discussion on the substance of this patch. So either the feature is
> extremely niche, or nobody understands it. I think you ought to take about
> three steps back and explain this in more basic terms, even just in email at
> first so that we can then discuss what to put into the documentation.

Here is the motivation for this feature:

When you design a benchmark to test performance, you want to avoid
unwanted correlation which may impact performance unduly, one way or
another. For the default pgbench benchmarks, accounts are chosen
uniformly, no problem. However, if you want to test a non uniform
distribution, which is what most people would encounter in practice,
things are different.

For instance, you would replace random by random_exp in the default
benchmarks. If you do that, then all accounts with lower ids would come
out more often. However this creates an unwanted correlation and
performance effect: why frequent accounts would just happen to be those
with small ids? which just happen to reside together in the first few
pages of the table? In order to avoid these effects, you need to shuffle
the chosen account ids, so that frequent accounts are not specifically
those at the beginning of the table.

Basically, as soon as you have a non uniform random generator selecting
step you want to add some shuffle, otherwise you are going to skew your
benchmark measures. Nobody should use a non-uniform generator for
selecting rows without some shuffling, ever. I have no doubt that many
people do without realizing that they are skewing their performance data.

Once you realize that, you can try to invent your own shuffling, but
frankly this is not as easy as it looks to have something non trivial
which would not generate unwanted correlation. I had a lot of (very) bad
design before coming up with the one in the patch: You want something
fast, you want good steering, which are contradictory objectives. There is
also the question of being able to change the shuffling for a given size,
for instance to check that there is no unwanted effects, hence the
seeding. Good seeded shuffling is what an encryption algorithm do, but for
these it is somehow simpler because they would usually work on power of
two sizes.

In conclusion, using a seeded shuffle step is needed as soon as you start
using non random generators. Providing one in pgbench, a tool designed to
run possibly non-uniform benchmarks against postgres, looks like common
courtesy. Not providing one is helping people to design bad benchmarks,
possibly without realizing it, so is outright thoughtlessness.

I have no doubt that the documentation should be improved to explain that
in a concise but clear way.

--
Fabien.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dent John 2020-02-01 09:09:32 Re: The flinfo->fn_extra question, from me this time.
Previous Message Pierre Ducroquet 2020-02-01 07:51:04 PATCH: add support for IN and @> in functional-dependency statistics use