Re: Gsoc2012 Idea --- Social Network database schema

From: Joshua Berkus <josh(at)agliodbs(dot)com>
To: Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil conway <neil(dot)conway(at)gmail(dot)com>, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Subject: Re: Gsoc2012 Idea --- Social Network database schema
Date: 2012-03-24 20:12:50
Message-ID: 612260444.406269.1332619970287.JavaMail.root@mail-1.01.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Qi,

Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough that people want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea.

----- Original Message -----
>
>
>
>
> > Date: Thu, 22 Mar 2012 13:17:01 -0400
> > Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database
> > schema
> > From: cbbrowne(at)gmail(dot)com
> > To: Kevin(dot)Grittner(at)wicourts(dot)gov
> > CC: pgsql-hackers(at)postgresql(dot)org
> >
> > On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> > <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> > > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > >>> Well, the standard syntax apparently aims to reduce the number
> > >>> of
> > >>> returned rows, which ORDER BY does not. Maybe you could do it
> > >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> > >>> like to sample the table without reading all of it first, so
> > >>> that
> > >>> seems to miss the point.
> > >>
> > >> I think actually the traditional locution is more like
> >! ; >> WHERE random() < constant
> > >> where the constant is the fraction of the table you want. And
> > >> yeah, the presumption is that you'd like it to not actually read
> > >> every row. (Though unless the sampling density is quite a bit
> > >> less than 1 row per page, it's not clear how much you're really
> > >> going to win.)
> > >
> > > It's all going to depend on the use cases, which I don't think
> > > I've
> > > heard described very well yet.
> > >
> > > I've had to pick random rows from, for example, a table of
> > > disbursements to support a financial audit. In those cases it has
> > > been the sample size that mattered, and order didn't. One
> > > interesting twist there is that for some of these financial
> > > audits
> > > they wanted the probability of a row being selected to be
> > > proportional ! to the dollar amount of the disbursement. I don't
> > > t hink you can do this without a first pass across the whole data
> > > set.
> >
> > This one was commonly called "Dollar Unit Sampling," though the
> > terminology has gradually gotten internationalized.
> > http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
> >
> > What the article doesn't mention is that some particularly large
> > items
> > might wind up covering multiple samples. In the example, they're
> > looking for a sample every $3125 down the list. If there was a
> > single
> > transaction valued at $30000, that (roughly) covers 10 of the
> > desired
> > samples.
> >
> > It isn't possible to do this without scanning across the entire
> > table.
> >
> > If you want repeatability, you probably want to instantiate a copy
> > of
> > enough information to indicate the ordering chosen. That's probably
> > something that needs to be captured as part of the work of the
> > audit,
> > so n! ot only does it need to involve a pass across the data, it
> > probably requires capturing a fair bit of data for posterity.
> > --
> > When confronted by a difficult problem, solve it by reducing it to
> > the
> > question, "How would the Lone Ranger handle this?"
>
>
>
>
>
>
> The discussion till now has gone far beyond my understanding.....
> Could anyone explain briefly what is the idea for now?
> The designing detail for me is still unfamiliar. I can only take time
> to understand while possible after being selected and put time on it
> to read relevant material.
> For now, I'm still curious why Neil's implementation is no longer
> working? The Postgres has been patched a lot, but the general idea
> behind Neil's implementation should still work, isn't it?
> Besides, whether this query is needed is still not decided. Seems
> this is another hard to decide point. Is it that this topic is still
> not so prepared for th e Gsoc yet? If really so, I think I still
> have time to switch to other topics. Any suggestion?
>
>
> Thanks.
>
> Best Regards and Thanks
> Huang Qi Victor
> Computer Science of National University of Singapore

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua Berkus 2012-03-24 20:22:51 Re: query cache
Previous Message Peter Eisentraut 2012-03-24 18:48:06 Re: Fix PL/Python metadata when there is no result