Skip site navigation (1) Skip section navigation (2)

Re: Gsoc2012 idea, tablesample

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-04-17 13:49:30
Message-ID: 4F8D74EA.8060504@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On 17.04.2012 14:55, Qi Huang wrote:
> Hi, Heikki   Thanks for your advice.    I will change my plan accordingly. But I have a few questions.
>> 1. We probably don't want the SQL syntax to be added to the grammar.
>> This should be written as an extension, using custom functions as the
>> API, instead of extra SQL syntax.
>
> 1. "This should be written as an extension, using custom functions as the API". Could you explain a bit more what does this mean?

I mean, it won't be integrated into the PostgeSQL server code. Rather, 
it will be a standalone module that can be distributed as a separate 
.tar.gz file, and installed on a server. PostgreSQL has some facilities 
to help you package code as extensions that can be easily distributed 
and installed.

>> 2. It's not very useful if it's just a dummy replacement for "WHERE
>> random()<  ?". It has to be more advanced than that. Quality of the
>> sample is important, as is performance. There was also an interesting
>> idea of on implementing monetary unit sampling.
>
> 2. In the plan, I mentioned using optimizer statistics to improve the quality of sampling.

Yeah, that's one approach. Would be nice to hear more about that, how 
exactly you can use optimizer statistics to help the sampling.

> I may emphasize on that point. I will read about monetary unit sampling and add into the plan about possibility of implementing this idea.

Ok, sounds good.

>> Another idea that Robert Haas suggested was to add support doing a TID
>> scan for a query like "WHERE ctid<  '(501,1)'". That's not enough work
>> for GSoC project on its own, but could certainly be a part of it.
>
> 3. I read about the replies on using ctid. But I don't quite understand how that might help. ctid is just a physical location of row version within the table. If I do "where ctid<'(501, 1)'", what is actually happening?

At the moment, if you do "WHERE ctid = '(501,1)', you get an access plan 
with a TidScan, which quickly fetches the row from that exact physical 
location. But if you do "WHERE ctid < '(501,1'), you get a SeqScan, 
which scans the whole table. That's clearly wasteful, you know the 
physical range of pages you need to scan: everything up to page 501. But 
the SeqScan will scan pages > 501, too. The idea is to improve that so 
that you'd only scan the pages up to page 501.

> Can I add in this as an optional implementation? I think I can check how to do this if I can have enough time in this project.

Yeah, that sounds reasonable.

> Besides, I saw the Gsoc site editing has been closed. Should I just submit through this mailing list with attachment?

Just post the updated details to this mailing list. Preferably inline, 
not as an attachment. You don't need to post the contact details, 
biography, etc, just updated inch-stones and project details parts.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

pgsql-hackers by date

Next:From: Stephen FrostDate: 2012-04-17 13:49:49
Subject: Re: Gsoc2012 idea, tablesample
Previous:From: Alvaro HerreraDate: 2012-04-17 13:47:26
Subject: libpq URI and regression testing

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group