Re: Obtaining random rows from a result set

From: "Josh Tolley" <eggyknap(at)gmail(dot)com>
To: "Alban Hertroys" <alban(at)magproductions(dot)nl>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Obtaining random rows from a result set
Date: 2007-08-31 13:33:36
Message-ID: e7e0a2570708310633k561eb423saf3953c45687365b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 8/31/07, Alban Hertroys <alban(at)magproductions(dot)nl> wrote:
> Hello,
>
> I've recently been busy improving a query that yields a fixed number of
> random records matching certain conditions. I have tried all the usual
> approaches, and although they do work, they're all limited in some way
> and don't translate really well to what you "want". They're kludges, IMHO.
>
> The methods I've tried are explained quite well on
> http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html
>
> All these methods involve calculating a random number for every record
> in the result set at some point in time, which is really not what I'm
> trying to model. I think the database should provide some means to get
> those records, so...
>
> Dear Santa,
>
> I'd like my database to have functionality analogue to how LIMIT works,
> but for other - non-sequential - algorithms.
>
> I was thinking along the lines of:
>
> SELECT *
> FROM table
> WHERE condition = true
> RANDOM 5;
>
> Which would (up to) return 5 random rows from the result set, just as
> LIMIT 5 returns (up to) the first 5 records in the result set.
>
>
> Or maybe even with a custom function, so that you could get non-linear
> distributions:
>
> SELECT *
> FROM table
> WHERE condition = true
> LIMIT 5 USING my_func();
>
> Where my_func() could be a user definable function accepting a number
> that should be (an estimate of?) the number of results being returned so
> that it can provide pointers to which rows in the resultset will be
> returned from the query.
>
> Examples:
> * random(maxrows) would return random rows from the resultset.
> * median() would return the rows in the middle of the result set (this
> would require ordering to be meaningful).
>
> What do people think, is this feasable? Desirable? Necessary?
>
> If I'd have time I'd volunteer for at least looking into this, but I'm
> working on three projects simultaneously already. Alas...
>
> Regards,
> Alban Hertroys.
>
> --
> Alban Hertroys
> alban(at)magproductions(dot)nl
>
> magproductions b.v.
>
> T: ++31(0)534346874
> F: ++31(0)534346876
> M:
> I: www.magproductions.nl
> A: Postbus 416
> 7500 AK Enschede
>
> // Integrate Your World //
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

It seems to me that anything that wants to return a random set of rows
will need to calculate a random number for all the rows it processes,
unless you change how the database scans rows in indexes or tables,
which if at all possible will probably make things *really* slow. If
it's a given that the database will always sequentially scan whatever
it is the query plan tells it to scan, you're pretty much stuck with
the rows in your result set being in the same order unless you start
picking random numbers.

One possible alternative not mentioned on the site you linked to is to
as follows:

select [whatever] from [table] where random() < [some number between 0
and 1] limit [limit value]

That doesn't require assigning a random number for *every* row in the
table, nor does it require sorting everything. It does mean that
numbers encountered earlier in the query processing have a higher
likelihood of being returned, and it also means that there's some
chance you won't actually get as many as [limit value] rows returned.

jtolley=# create table a (i integer);
CREATE TABLE
jtolley=# insert into a (i) select * from generate_series(1, 100);
INSERT 0 100
jtolley=# create table a (i integer);
CREATE TABLE
jtolley=# insert into a (i) select * from generate_series(1, 100);
INSERT 0 100
jtolley=# select * from a where random() < .1 limit 3;
i
----
22
23
25
(3 rows)

Hope this helps...

-Josh

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Kaloyan Iliev 2007-08-31 13:34:48 Re: Obtaining random rows from a result set
Previous Message Josh Tolley 2007-08-31 13:17:36 Re: URGENT: Whole DB down ("no space left on device")