Re: funny view/temp table problem with query

From: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
Cc: PostgreSQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: funny view/temp table problem with query
Date: 2009-02-26 09:46:30
Message-ID: 596DB518-E3A8-4613-AB03-34E0981A2ED9@solfertje.student.utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Feb 25, 2009, at 11:02 AM, Grzegorz Jaśkiewicz wrote:

> So I have a 'accounts' table, with id and name, and than some
> hypothetical 'packages' table, containing some info per customer.
>
> I need to retrive distinct pairs , of random packages assigned per
> customer.
> Packages table contains 10 packages, id:=[1:10], there's 1M customers
> for testing purposes.
>
> I could name the tables foo/bar again, but decided for something more
> creative this time ;)
>
> Anyways, I have this query:
>
>
> select count(distinct (v,id)) from (
> select heh.id, v[i] from
> (
> SELECT ss.id, ARRAY
> (
> SELECT id FROM packages where ss.id>0 and id between 2 and 6
> ORDER BY random() limit 5
> ) as v FROM
> (
> SELECT id FROM accounts ORDER BY random() limit 100000
> ) ss
> ) heh,generate_series(1, 5 ) i order by heh.id,v
> ) ziew;

An alternative solution is to NOT order by random and not to limit,
but to use a scrollable cursor. Having to order your entire result set
by random is a fairly expensive operation and you only want 5 random
rows anyway, not 100000, so it is an inefficient approach as well: In
a good solution you should be calculating random() 5 times, not 100000.

Normal cursors just pick the next row from the result set as you
request them. Scrollable ones allow you to pick specific rows from
that result set. As soon as you know how many rows you have, picking 5
random ones isn't that hard. The idea is to calculate 5 random row
numbers from your result set and retrieve only those rows.

To do this you'll first need to know how many rows there are. That can
be determined by scrolling to the last row and reading the instruction
result (not the record itself) of that instruction; it contains a row
number (mind the one-off difference with a row count).

That row number you can feed to the random() function so it returns
numbers from 1..(lastRow+1). Scroll to that row and read the result,
repeat as often as you like (5 times in your case).

You may have realised that there is a chance to get duplicates here if
you happen to calculate the same random row number more than once.
That's not very hard to fix of course, you only need to keep track of
which row numbers you already used and recalculate the random number
if it's already in your set.

You can put the code to do this in your application (if your
connection interface allows for scrollable cursors), or since pg8.3
you can create a stored procedure to do this. I believe before 8.3
scrollable cursors weren't usable in pl/pgsql. Then again, maybe other
pl-languages are more suitable for a general solution... Back when I
had this problem I was using PHP and pg8.1, putting the code in a
function in my application worked fine, but it felt like it didn't
belong there.

The general opinion seems to be that picking random rows isn't a
relational operation, and for that reason a relational database isn't
particularly good at that. I think my approach works as well as it
does because it's a procedural approach to a procedural problem.

If you'd like to see some code, I have posted about this in the past
and that contained some code examples. Just search the archives.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,49a664fc129742059914308!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Daniel Verite 2009-02-26 09:50:18 Format string for ISO-8601 date and time
Previous Message Craig Ringer 2009-02-26 04:24:13 Re: Can I use a query with UPDATE on its SET?