Re: TABLESAMPLE patch

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-04-04 14:55:01
Message-ID: 551FFB45.6060707@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/04/15 14:57, Amit Kapila wrote:
>
> 1.
> +tablesample_clause:
> +TABLESAMPLE ColId '(' func_arg_list ')' opt_repeatable_clause
>
>
> Why do you want to allow func_arg_list?
>
> Basically if user tries to pass multiple arguments in
> TABLESAMPLE method's clause like (10,20), then I think
> that should be caught in grammer level as an error.

It will be reported as error during parse analysis if the TABLESAMPLE
method does not accept two parameters, same as when the expression used
wrong type for example.

>
> SQL - 2003 specs says that argument to REPAEATABLE and TABLESAMPLE
> method is same <numeric value expr>
>
> It seems to me that you want to allow it to make it extendable
> to user defined Tablesample methods, but not sure if it is
> right to use func_arg_list for the same because sample method
> arguments can have different definition than function arguments.
> Now even if we want to keep sample method arguments same as
> function arguments that sounds like a separate discussion.
>

Yes I want extensibility here. And I think the tablesample method
arguments are same thing as function arguments given that in the end
they are arguments for the init function of tablesampling method.

I would be ok with just expr_list, naming parameters here isn't overly
important, but ability to have different types and numbers of parameters
for custom TABLESAMPLE methods *is* important.

> In general, I feel you already have good basic infrastructure for
> supportting other sample methods, but it is better to keep the new
> DDL's for doing the same as a separate patch than this patch, as that
> way we can reduce the scope of this patch, OTOH if you or others
> feel that it is mandatory to have new DDL's support for other
> tablesample methods then we have to deal with this now itself.

Well I did attach it as separate diff mainly for that reason. I agree
that DDL does not have to be committed immediately with the rest of the
patch (although it's the simplest part of the patch IMHO).

>
> 2.
> postgres=# explain update test_tablesample TABLESAMPLE system(30) set id
> = 2;
> ERROR: syntax error at or near "TABLESAMPLE"
> LINE 1: explain update test_tablesample TABLESAMPLE system(30) set i...
>
> postgres=# Delete from test_tablesample TABLESAMPLE system(30);
> ERROR: syntax error at or near "TABLESAMPLE"
> LINE 1: Delete from test_tablesample TABLESAMPLE system(30);
>
> Isn't TABLESAMPLE clause suppose to work with Update/Delete
> statements?
>

It's supported in the FROM part of UPDATE and USING part of DELETE. I
think that that's sufficient.

Standard is somewhat useless for UPDATE and DELETE as it only defines
quite limited syntax there. From what I've seen when doing research
MSSQL also only supports it in their equivalent of FROM/USING list,
Oracle does not seem to support their SAMPLING clause outside of SELECTs
at all and if I got the cryptic DB2 manual correctly I think they don't
support it outside of (sub)SELECTs either.

> 4.
> SampleTupleVisible()
> {
> ..
> else
> {
> /* No pagemode, we have to check the tuple itself. */
> Snapshot
> snapshot = scan->rs_snapshot;
> Bufferbuffer = scan->rs_cbuf;
>
> bool visible =
> HeapTupleSatisfiesVisibility(tuple, snapshot, buffer);
> ..
> }
>
> I think it is better to check if PageIsAllVisible() in above
> code before visibility check as that can avoid visibility checks.

It's probably even better to do that one level up in the samplenexttup()
and only call the SampleTupleVisible if page is not allvisible
(PageIsAllVisible() is cheap).

>
> 6.
> extern TableSampleClause *
> ParseTableSample(ParseState *pstate, char *samplemethod, Node *repeatable,
> List *sampleargs)
>
> It is better to expose function (usage of extern) via header file.
> Is there a need to mention extern here?
>

Eh, stupid copy-paste error when copying function name from header to
actual source file.

> 7.
> ParseTableSample()
> {
> ..
> arg = coerce_to_specific_type(pstate, arg, INT4OID, "REPEATABLE");
> ..
> }
>
> What is the reason for coercing value of REPEATABLE clause to INT4OID
> when numeric value is expected for the clause. If user gives the
> input value as -2.3, it will generate a seed which doesn't seem to
> be right.
>

Because the REPEATABLE is numeric expression so it can produce whatever
number but we need int4 internally (well float4 would also work just the
code would be slightly uglier). And we do this type of coercion even for
table data (you can insert -2.3 into integer column and it will work) so
I don't see what's wrong with it here.

>
> 8.
> +DATA(insert OID = 3295 ( tsm_system_initPGNSP PGUID 12 1 0 0 0 f f f f
> t f v 3 0 2278 "2281
> 23 700" _null_ _null_ _null_ _null_tsm_system_init _null_ _null_ _null_ ));
> +DATA(insert OID = 3301 ( tsm_bernoulli_initPGNSP PGUID 12 1 0 0 0 f f
> f f t f v 3 0 2278 "2281
> 23 700" _null_ _null_ _null_ _null_tsm_bernoulli_init _null_ _null_
> _null_ ));
>
> Datatype for second argument is kept as 700 (Float4), shouldn't
> it be 1700 (Numeric)?
>

Why is that? Given the sampling error I think the float4 is enough for
specifying the percentage and it makes the calculations much easier and
faster than dealing with Numeric would.

> 9.
> postgres=# explain SELECT t1.id <http://t1.id> FROM test_tablesample as
> t1 TABLESAMPLE SYSTEM (
> 10), test_tablesample as t2 TABLESAMPLE BERNOULLI (20);
> QUERY PLAN
> ----------------------------------------------------------------------------
> Nested Loop (cost=0.00..4.05 rows=100 width=4)
> -> Sample Scan on test_tablesample t1 (cost=0.00..0.01 rows=1 width=4)
> -> Sample Scan on test_tablesample t2 (cost=0.00..4.02 rows=2 width=0)
> (3 rows)
>
> Isn't it better to display sample method name in explain.
> I think it can help in case of join queries.
> We can use below format to display:
> Sample Scan (System) on test_tablesample ...

Good idea.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-04-04 14:57:41 Re: Compile warnings on OSX 10.10 clang 6.0
Previous Message Andrew Gierth 2015-04-04 14:45:44 Re: Abbreviated keys for Numeric