Re: Gsoc2012 Idea --- Social Network database schema

From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Gsoc2012 Idea --- Social Network database schema
Date: 2012-03-22 17:17:01
Message-ID: CAFNqd5UaDOZFoa_bk6n0fo-LvVsWaQmWbzdYx+LaJikh6WykSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not.  Maybe you could do it
>>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>>> like to sample the table without reading all of it first, so that
>>> seems to miss the point.
>>
>> I think actually the traditional locution is more like
>>       WHERE random() < constant
>> where the constant is the fraction of the table you want.  And
>> yeah, the presumption is that you'd like it to not actually read
>> every row.  (Though unless the sampling density is quite a bit
>> less than 1 row per page, it's not clear how much you're really
>> going to win.)
>
> It's all going to depend on the use cases, which I don't think I've
> heard described very well yet.
>
> I've had to pick random rows from, for example, a table of
> disbursements to support a financial audit.  In those cases it has
> been the sample size that mattered, and order didn't.  One
> interesting twist there is that for some of these financial audits
> they wanted the probability of a row being selected to be
> proportional to the dollar amount of the disbursement.  I don't
> think you can do this without a first pass across the whole data
> set.

This one was commonly called "Dollar Unit Sampling," though the
terminology has gradually gotten internationalized.
http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html

What the article doesn't mention is that some particularly large items
might wind up covering multiple samples. In the example, they're
looking for a sample every $3125 down the list. If there was a single
transaction valued at $30000, that (roughly) covers 10 of the desired
samples.

It isn't possible to do this without scanning across the entire table.

If you want repeatability, you probably want to instantiate a copy of
enough information to indicate the ordering chosen. That's probably
something that needs to be captured as part of the work of the audit,
so not only does it need to involve a pass across the data, it
probably requires capturing a fair bit of data for posterity.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-03-22 17:19:04 Re: Re: pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation)
Previous Message Kevin Grittner 2012-03-22 16:38:17 Re: Gsoc2012 Idea --- Social Network database schema