Re: Gsoc2012 Idea --- Social Network database schema

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Gsoc2012 Idea --- Social Network database schema
Date: 2012-03-21 15:59:49
Message-ID: 4F69FAF5.8000809@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/21/2012 11:49 AM, Robert Haas wrote:
> On Wed, Mar 21, 2012 at 11:34 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not. Maybe you could do it with
>>> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
>>> sample the table without reading all of it first, so that seems to
>>> miss the point.
>> I think actually the traditional locution is more like
>> WHERE random()< constant
>> where the constant is the fraction of the table you want. And yeah,
>> the presumption is that you'd like it to not actually read every row.
>> (Though unless the sampling density is quite a bit less than 1 row
>> per page, it's not clear how much you're really going to win.)
> Well, there's something mighty tempting about having a way to say
> "just give me a random sample of the blocks and I'll worry about
> whether that represents a random sample of the rows".
>
> It's occurred to me a few times that it's pretty unfortunate you can't
> do that with a TID condition.
>
> rhaas=# explain select * from randomtext where ctid>= '(500,1)' and
> ctid< '(501,1)';
> QUERY PLAN
> --------------------------------------------------------------------
> Seq Scan on randomtext (cost=0.00..111764.90 rows=25000 width=31)
> Filter: ((ctid>= '(500,1)'::tid) AND (ctid< '(501,1)'::tid))
> (2 rows)
>
> The last time this came up for me was when I was trying to find which
> row in a large table as making the SELECT blow up; but it seems like
> it could be used to implement a poor man's sampling method, too... it
> would be nicer, in either case, to be able to specify the block
> numbers you'd like to be able to read, rather than bounding the CTID
> from both ends as in the above example.

That would rapidly get unmanageable when you wanted lots of pages.

Maybe we could do something like a pagenum pseudovar, or a wildcard
match for ctid against '(123,*)'.

cheers

andrew

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-03-21 16:00:40 Re: Gsoc2012 Idea --- Social Network database schema
Previous Message Andres Freund 2012-03-21 15:58:22 Re: Command Triggers