Re: Gsoc2012 idea, tablesample

From: Qi Huang <huangqiyx(at)hotmail(dot)com>
To: <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <josh(at)agliodbs(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Subject: Re: Gsoc2012 idea, tablesample
Date: 2012-04-21 06:28:52
Message-ID: BAY159-W354843E3FF73CC06DD86F0A3230@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi, Heikki
> 1. We probably don't want the SQL syntax to be added to the grammar.
> This should be written as an extension, using custom functions as the
> API, instead of extra SQL syntax.
>
> 2. It's not very useful if it's just a dummy replacement for "WHERE
> random() < ?". It has to be more advanced than that. Quality of the
> sample is important, as is performance. There was also an interesting
> idea of on implementing monetary unit sampling.
>
>
> Another idea that Robert Haas suggested was to add support doing a TID
> scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
> for GSoC project on its own, but could certainly be a part of it.
>

Based on the discussion these days and my understanding, I don't see much change to be made in my proposal. For the 3 points you raised, the first one and the last one are still not clear. Especially the first point, I see that some people think making the SQL syntax into grammar is first choice. For the second point, the SQL standard 2003 defines two methods for sampling, SYSTEM and BERNOULLI. I think there might be possible quality refinement for them. For the optimization statistics, I have an idea of using it to assign different sampling percentages to different pages, but I'm not sure about the detail yet, I need to see into and learn the optimization statistics (this part is mentioned by Neil in his idea, so I think there should be way of using it). Also there might be enhance on specific sampling, like monetary unit sampling or the geographic indexes sampling. I can do this part(sampling quality improvement) as research based project. We can still discuss deeper to see whether these can be done and how we can do them.
I post my current amended proposal below. The changes are in red color.
-----------------------------------------------------------------------------------------------------------------------Project Details:

Neil Conway has come up
with an implementation at 2007 while he gave a talk of introducing to hacking
in PostgreSQL. The code was just for demo purpose and was incomplete. It was
not integrated into PostgreSQL patch. The PostgreSQL now is quite different from
2007. To implement this query, I need to understand Neil’s implementation and
use the general idea to implement in the most up-to-date PostgreSQL release. In
the end, I need to test the implementation till it can be released in the next
patch. I will also explore possible ways of further enhancing the sampling quality, like using optimization statistics to produce more accurate sample. There is also suggestions that I can integrate different sampling types into this query, like for accounting data, I can use "monetary unit sampling" to get the result, or implement the geographic indexes sampling. These specific types of sampling might require at least a sequence scan on data, which means slow down the sampling speed, but sampling quality will enhance greatly and be very useful in some specific fields.

List
of features:

1. TABLESAMPLE using
select, delete, and update

2. SYSTEM method

3. REPEATABLE support

4. BERNOULLI method

5. Use optimizer
statistics to produce a more accurate sample

6. non-integer
sample percentage and repeat seed7. sampling quality enhancement

4, 5 and 6 are not
included in Neil’s implementation.

For 5, we can use
optimizer statistics to refine the algorithm for the random number selection of
pages or rows. The sample produced shall be more accurate.

Inch-stones:

1. Conduct
the basic features' implementation, able to query TABLESAMPLE clause using
select, SYSTEM, with different combination of SQL queries.

2.
Implementation of other basic features, REPEATABLE and BERNOULLI.

3. Improvement
implementation. Support for using optimizer statistics to produce more accurate
sample, non-integer sample percentage and repeat seed, and sampling quality improvement.

Project Schedule:

1. From
April 23rd-May 10th: learning and understanding.

2. From
Mid May- Mid June: implement simple TABLESAMPLE clause, with SYSTEM method, and
no REPEATABLE support. And do testing.

3. Mid
June-Mid July: implement other supports, like REPEATABLE clause, and BERNOULLI
method, and do testing. Improvement 5 and 6 are also implemented now.

4. Mid July-
Mid Aug: Explore ways of improving sampling quality should be done at period 2 and 3. This period will be used to implement those ideas. -----------------------------------------------------------------------------------------------------------------------

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Albe Laurenz 2012-04-21 08:33:43 Re: Plan stability versus near-exact ties in cost estimates
Previous Message Tom Lane 2012-04-21 04:55:47 Re: Plan stability versus near-exact ties in cost estimates