Skip site navigation (1) Skip section navigation (2)

Re: Selecting K random rows - efficiently!

From: "vincent" <vinny(at)xs4all(dot)nl>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Selecting K random rows - efficiently!
Date: 2007-10-30 09:08:03
Message-ID: 24973.84.107.170.210.1193735283.squirrel@webmail.xs4all.nl (view raw or flat)
Thread:
Lists: pgsql-general
> 2007/10/26, Patrick TJ McPhee <ptjm(at)interlog(dot)com>:
>> In article <ffnid8$1q2t$1(at)news(dot)hub(dot)org>, cluster  <skrald(at)amossen(dot)dk>
>> wrote:
>> % > How important is true randomness?
>> %
>> % The goal is an even distribution but currently I have not seen any way
>> % to produce any kind of random sampling efficiently. Notice the word
>>
>> How about generating the ctid randomly? You can get the number of pages
>> from pg_class and estimate the number of rows either using the number
>> of tuples in pg_class or just based on what you know about the data.
>> Then just generate two series of random numbers, one from 0 to the
>> number
>> of pages and the other from 1 to the number of rows per page, and keep
>> picking rows until you have enough numbers. Assuming there aren't too
>> many dead tuples and your estimates are good, this should retrieve n
>> rows
>> with roughly n look-ups.
>>
>> If your estimates are low, there will be tuples which can never be
>> selected,
>> and so far as I know, there's no way to construct a random ctid in a
>> stock
>> postgres database, but apart from that it seems like a good plan. If
>> efficiency is important, you could create a C function which returns a
>> series of random tids and join on that.
>> --
>>
>
> SELECT id, ...
>    FROM data
>   WHERE id = ANY(ARRAY(
>                        SELECT (random()*max_id)::int
>                           FROM generate_series(1,20)))
>   LIMIT 1;
>
> -- max_id is external constant
>
> Pavel Stehule

That selects records where the id is one of twenty random numbers between
zero and the maximum id. That's not truely random, nor is it completely
safe if there are lots of gaps in the values of id. For example, if the
lowest id  is 50000 and the highest is 50040, this will be very likely to
generate lots of numbers below 50000 and find no records at all.


Responses

pgsql-general by date

Next:From: Alexis BeuraudDate: 2007-10-30 10:07:45
Subject: Checking empty array
Previous:From: Pavel StehuleDate: 2007-10-30 08:52:21
Subject: Re: Selecting K random rows - efficiently!

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group