Re: Gsoc2012 Idea --- Social Network database schema

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>,"daniel" <daniel(at)heroku(dot)com>, "Qi Huang" <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gsoc2012 Idea --- Social Network database schema
Date: 2012-03-22 16:38:17
Message-ID: 4F6B0F2902000025000465DA@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Well, the standard syntax apparently aims to reduce the number of
>> returned rows, which ORDER BY does not. Maybe you could do it
>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>> like to sample the table without reading all of it first, so that
>> seems to miss the point.
>
> I think actually the traditional locution is more like
> WHERE random() < constant
> where the constant is the fraction of the table you want. And
> yeah, the presumption is that you'd like it to not actually read
> every row. (Though unless the sampling density is quite a bit
> less than 1 row per page, it's not clear how much you're really
> going to win.)

It's all going to depend on the use cases, which I don't think I've
heard described very well yet.

I've had to pick random rows from, for example, a table of
disbursements to support a financial audit. In those cases it has
been the sample size that mattered, and order didn't. One
interesting twist there is that for some of these financial audits
they wanted the probability of a row being selected to be
proportional to the dollar amount of the disbursement. I don't
think you can do this without a first pass across the whole data
set.

I've also been involved in developing software to pick random people
for jury selection processes. In some of these cases, you don't
want a certain percentage, you want a particular number of people,
and you want the list to be ordered so that an initial set of
potential jurors can be seated from the top of the list and then as
the voir dire[1] process progresses you can replace excused jurors
from progressive positions on the randomized list.

In both cases you need to be able to explain the technique used in
lay terms and show why it is very hard for anyone to predict or
control which rows are chosen, but also use statistical analysis to
prove that there is no significant correlation between the rows
chosen and identifiable characteristics of the rows. For example,
selecting all the rows from random blocks would be right out for
juror selection because a list from the state DOT of people with
driver's licenses and state ID cards would probably be in license
number order when loaded, and since the start of the driver's
license number is a soundex of the last name, there is a strong
correlation between blocks of the table and ethnicity.

One technique which might be suitably random without reading the
whole table would be to figure out a maximum block number and tuple
ID for the table, and generate a series of random ctid values to
read. If the tuple doesn't exist or is not visible to the snapshot,
you ignore it and continue, until you have read the requisite number
of rows. You could try to generate them in advance and sort them by
block number, but then you need to solve the problems of what to do
if that set of ctids yields too many rows or too few rows, both of
which have sticky issues.

-Kevin

[1] http://en.wikipedia.org/wiki/Voir_dire#Use_in_the_United_States

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Browne 2012-03-22 17:17:01 Re: Gsoc2012 Idea --- Social Network database schema
Previous Message Tom Lane 2012-03-22 16:10:15 Re: Trivial libpq refactoring patch (was: Re: Another review of URI for libpq, v7 submission)