Re: Selecting K random rows - efficiently!

From: "John D(dot) Burger" <john(at)mitre(dot)org>
To: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Selecting K random rows - efficiently!
Date: 2007-10-26 04:01:38
Message-ID: 6134F4E4-6FA3-46E9-A1EE-6DBA015E60EF@mitre.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

As far as I can tell, all of the proposed solutions lack sample
independence. Take the OP's suggested approach of doing something
like this:

SELECT * FROM mydata
WHERE mydata.random_number >= (SELECT RANDOM() OFFSET 0)
ORDER BY mydata.random_number ASC LIMIT 100

All you're doing is picking random =subsequences= from the same
permutation of the original data. This is not the same as a random
sample. That is, if rows A and B are adjacent in the permutation,
then if A is in the sample, B will also be in it with very high
probability, depending on the size of the sample. Another way of
saying this is that the first element of the sample is selected
randomly, the rest are completely deterministic. In a true random
sample, different elements are selected independently.

On the other hand, ORDER BY RANDOM() does indeed construct true
random samples, because it makes a new permutation every time. If
you want to use the random_number column approach, then you need to
do the same. You can accomplish this by sampling from the original
permutation repeatedly, doing the above with LIMIT 1 as many times as
you need. Yes this is more costly, but TANSTAAFL.

As is often observed, it's easy to create the appearance of
randomness, harder to accomplish in reality.

- John Burger
MITRE

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gowrishankar L 2007-10-26 04:19:51 Data cube in PostgreSQL
Previous Message Sean Z. 2007-10-26 02:30:14 Partitioning: how to exclude unrelated partitions?