Re: Gsoc2012 Idea --- Social Network database schema

From: Qi Huang <huangqiyx(at)hotmail(dot)com>
To: <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Cc: <pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <josh(at)agliodbs(dot)com>
Subject: Re: Gsoc2012 Idea --- Social Network database schema
Date: 2012-03-24 04:20:26
Message-ID: BAY159-W491A758F09F0D6D98966A3A3470@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Date: Thu, 22 Mar 2012 13:17:01 -0400
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> From: cbbrowne(at)gmail(dot)com
> To: Kevin(dot)Grittner(at)wicourts(dot)gov
> CC: pgsql-hackers(at)postgresql(dot)org
>
> On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >>> Well, the standard syntax apparently aims to reduce the number of
> >>> returned rows, which ORDER BY does not. Maybe you could do it
> >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> >>> like to sample the table without reading all of it first, so that
> >>> seems to miss the point.
> >>
> >> I think actually the traditional locution is more like
> >> WHERE random() < constant
> >> where the constant is the fraction of the table you want. And
> >> yeah, the presumption is that you'd like it to not actually read
> >> every row. (Though unless the sampling density is quite a bit
> >> less than 1 row per page, it's not clear how much you're really
> >> going to win.)
> >
> > It's all going to depend on the use cases, which I don't think I've
> > heard described very well yet.
> >
> > I've had to pick random rows from, for example, a table of
> > disbursements to support a financial audit. In those cases it has
> > been the sample size that mattered, and order didn't. One
> > interesting twist there is that for some of these financial audits
> > they wanted the probability of a row being selected to be
> > proportional to the dollar amount of the disbursement. I don't
> > think you can do this without a first pass across the whole data
> > set.
>
> This one was commonly called "Dollar Unit Sampling," though the
> terminology has gradually gotten internationalized.
> http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
>
> What the article doesn't mention is that some particularly large items
> might wind up covering multiple samples. In the example, they're
> looking for a sample every $3125 down the list. If there was a single
> transaction valued at $30000, that (roughly) covers 10 of the desired
> samples.
>
> It isn't possible to do this without scanning across the entire table.
>
> If you want repeatability, you probably want to instantiate a copy of
> enough information to indicate the ordering chosen. That's probably
> something that needs to be captured as part of the work of the audit,
> so not only does it need to involve a pass across the data, it
> probably requires capturing a fair bit of data for posterity.
> --
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"

The discussion till now has gone far beyond my understanding.....Could anyone explain briefly what is the idea for now? The designing detail for me is still unfamiliar. I can only take time to understand while possible after being selected and put time on it to read relevant material. For now, I'm still curious why Neil's implementation is no longer working? The Postgres has been patched a lot, but the general idea behind Neil's implementation should still work, isn't it? Besides, whether this query is needed is still not decided. Seems this is another hard to decide point. Is it that this topic is still not so prepared for the Gsoc yet? If really so, I think I still have time to switch to other topics. Any suggestion?
Thanks.

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Boszormenyi Zoltan 2012-03-24 09:49:07 Re: ECPG FETCH readahead
Previous Message Dan McGee 2012-03-24 04:17:18 [PATCH] Never convert n_distinct < 2 values to a ratio when computing stats