Re: Selecting K random rows - efficiently!

From: "vincent" <vinny(at)xs4all(dot)nl>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Selecting K random rows - efficiently!
Date: 2007-10-30 09:08:03
Message-ID: 24973.84.107.170.210.1193735283.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> 2007/10/26, Patrick TJ McPhee <ptjm(at)interlog(dot)com>:
>> In article <ffnid8$1q2t$1(at)news(dot)hub(dot)org>, cluster <skrald(at)amossen(dot)dk>
>> wrote:
>> % > How important is true randomness?
>> %
>> % The goal is an even distribution but currently I have not seen any way
>> % to produce any kind of random sampling efficiently. Notice the word
>>
>> How about generating the ctid randomly? You can get the number of pages
>> from pg_class and estimate the number of rows either using the number
>> of tuples in pg_class or just based on what you know about the data.
>> Then just generate two series of random numbers, one from 0 to the
>> number
>> of pages and the other from 1 to the number of rows per page, and keep
>> picking rows until you have enough numbers. Assuming there aren't too
>> many dead tuples and your estimates are good, this should retrieve n
>> rows
>> with roughly n look-ups.
>>
>> If your estimates are low, there will be tuples which can never be
>> selected,
>> and so far as I know, there's no way to construct a random ctid in a
>> stock
>> postgres database, but apart from that it seems like a good plan. If
>> efficiency is important, you could create a C function which returns a
>> series of random tids and join on that.
>> --
>>
>
> SELECT id, ...
> FROM data
> WHERE id = ANY(ARRAY(
> SELECT (random()*max_id)::int
> FROM generate_series(1,20)))
> LIMIT 1;
>
> -- max_id is external constant
>
> Pavel Stehule

That selects records where the id is one of twenty random numbers between
zero and the maximum id. That's not truely random, nor is it completely
safe if there are lots of gaps in the values of id. For example, if the
lowest id is 50000 and the highest is 50040, this will be very likely to
generate lots of numbers below 50000 and find no records at all.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alexis Beuraud 2007-10-30 10:07:45 Checking empty array
Previous Message Pavel Stehule 2007-10-30 08:52:21 Re: Selecting K random rows - efficiently!