Re: add modulo (%) operator to pgbench

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: add modulo (%) operator to pgbench
Date: 2014-09-24 10:28:06
Message-ID: 54229CB6.5010608@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/24/2014 10:45 AM, Fabien COELHO wrote:
> Currently these distributions are achieved by mapping a continuous
> function onto integers, so that neighboring integers get neighboring
> number of draws, say with size=7:
>
> #draws 10 6 3 1 0 0 0 // some exponential distribution
> int drawn 0 1 2 3 4 5 6
>
> Although having an exponential distribution of accesses on tuples is quite
> reasonable, the likelyhood there would be so much correlation between
> neighboring values is not realistic at all. You need some additional
> shuffling to get there.
>
>> I don't understand what that pseudo-random stage you're talking about is. Can
>> you elaborate?
>
> The pseudo random stage is just a way to scatter the values. A basic
> approach to achieve this is "i' = (i * large-prime) % size", if you have a
> modulo. For instance with prime=5 you may get something like:
>
> #draws 10 6 3 1 0 0 0
> int drawn 0 1 2 3 4 5 6 (i)
> scattered 0 5 3 1 6 4 2 (i' = 5 i % 7)
>
> So the distribution becomes:
>
> #draws 10 1 0 3 0 6 0
> scattered 0 1 2 3 4 5 6
>
> Which is more interesting from a testing perspective because it removes
> the neighboring value correlation.

Depends on what you're testing. Yeah, shuffling like that makes sense
for a primary key. Or not: very often, recently inserted rows are also
queried more often, so that there is indeed a strong correlation between
the integer key and the access frequency. Or imagine that you have a
table that stores the height of people in centimeters. To populate that,
you would want to use a gaussian distributed variable, without shuffling.

For shuffling, perhaps we should provide a pgbench function or operator
that does that directly, instead of having to implement it using * and
%. Something like hash(x, min, max), where x is the input variable
(gaussian distributed, or whatever you want), and min and max are the
range to map it to.

> I must say that I'm appaled by a decision process which leads to such
> results, with significant patches passed, and the tiny complement to make
> it really useful (I mean not on the paper or on the feature list, but in
> real life) is rejected...

The idea of a modulo operator was not rejected, we'd just like to have
the infrastructure in place first.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-09-24 10:40:34 Re: jsonb format is pessimal for toast compression
Previous Message Heikki Linnakangas 2014-09-24 10:17:13 Re: Extending COPY TO