Re: add modulo (%) operator to pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: add modulo (%) operator to pgbench
Date: 2014-09-24 07:45:22
Message-ID: alpine.DEB.2.10.1409240916340.3672@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Heikki,

>> If you reject it, you can also remove the gaussian and exponential random
>> distribution which is near useless without a mean to add a minimal
>> pseudo-random stage, for which "(x * something) % size" is a reasonable
>> approximation, hence the modulo submission.
>
> I'm confused. The gaussian and exponential patch was already committed.

Yes.

They are significant patches that really involved significant work, and
which are only really useful with a complementary stupid 10 lines patch
which is being rejected without understanding why it is needed.

> Are you saying that it's not actually useful, and should be reverted?
> That doesn't make sense to me, gaussian and exponential distributed
> values seems quite useful to me in test scripts.

Yes and no.

Currently these distributions are achieved by mapping a continuous
function onto integers, so that neighboring integers get neighboring
number of draws, say with size=7:

#draws 10 6 3 1 0 0 0 // some exponential distribution
int drawn 0 1 2 3 4 5 6

Although having an exponential distribution of accesses on tuples is quite
reasonable, the likelyhood there would be so much correlation between
neighboring values is not realistic at all. You need some additional
shuffling to get there.

> I don't understand what that pseudo-random stage you're talking about is. Can
> you elaborate?

The pseudo random stage is just a way to scatter the values. A basic
approach to achieve this is "i' = (i * large-prime) % size", if you have a
modulo. For instance with prime=5 you may get something like:

#draws 10 6 3 1 0 0 0
int drawn 0 1 2 3 4 5 6 (i)
scattered 0 5 3 1 6 4 2 (i' = 5 i % 7)

So the distribution becomes:

#draws 10 1 0 3 0 6 0
scattered 0 1 2 3 4 5 6

Which is more interesting from a testing perspective because it removes
the neighboring value correlation.

A better approach is to use a hash function. "i' = hash(i) % size",
although it skews the distribution some more, but the quality of the
shuffling is much better than with the mult-modulo version above.
Note that you need a modulo as well...

I must say that I'm appaled by a decision process which leads to such
results, with significant patches passed, and the tiny complement to make
it really useful (I mean not on the paper or on the feature list, but in
real life) is rejected...

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2014-09-24 08:27:46 Re: pgcrypto: PGP signatures
Previous Message Heikki Linnakangas 2014-09-24 06:58:46 Re: BRIN indexes - TRAP: BadArgument