Re: TABLESAMPLE patch

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-04-06 12:26:22
Message-ID: 55227B6E.8060407@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/04/15 12:33, Amit Kapila wrote:
> On Sat, Apr 4, 2015 at 8:25 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com
> <mailto:petr(at)2ndquadrant(dot)com>> wrote:
> >
> > Yes I want extensibility here. And I think the tablesample method
> arguments are same thing as function arguments given that in the end
> they are arguments for the init function of tablesampling method.
> >
> > I would be ok with just expr_list, naming parameters here isn't
> overly important, but ability to have different types and numbers of
> parameters for custom TABLESAMPLE methods *is* important.
> >
>
> Yeah, named parameters is one reason which I think won't
> be required for sample methods and neither the same is
> mentioned in docs (if user has to use, what is the way he
> can pass the same) and another is number of arguments
> for sampling methods which is currently seems to be same
> as FUNC_MAX_ARGS, I think that is sufficient but do we
> want to support that many arguments for sampling method.
>

I think I'll go with simple list of a_exprs. The reason for that is that
I can foresee sampling methods that need multiple parameters, but I
don't think naming them is very important. Also adding support for
naming parameters can be done in the future if we decide so without
breaking compatibility. Side benefit is that it makes hinting about what
is wrong with input somewhat easier.

I don't think we need to come up with different limit from
FUNC_MAX_ARGS. I don't think any sampling method would need that many
parameters but I also don't see what would additional smaller limit give us.

>
> >>
> >> 2.
> >> postgres=# explain update test_tablesample TABLESAMPLE system(30) set id
> >> = 2;
> >> ERROR: syntax error at or near "TABLESAMPLE"
> >> LINE 1: explain update test_tablesample TABLESAMPLE system(30) set i...
> >>
> >> postgres=# Delete from test_tablesample TABLESAMPLE system(30);
> >> ERROR: syntax error at or near "TABLESAMPLE"
> >> LINE 1: Delete from test_tablesample TABLESAMPLE system(30);
> >>
> >> Isn't TABLESAMPLE clause suppose to work with Update/Delete
> >> statements?
> >>
> >
> > It's supported in the FROM part of UPDATE and USING part of DELETE. I
> think that that's sufficient.
> >
>
> But I think the Update on target table with sample scan is
> supported via views which doesn't seem to be the right thing
> in case you just want to support it via FROM/USING, example
>
> postgres=# create view vw_test As select * from test_tablesample
> TABLESAMPLE sys
> tem(30);
> postgres=# explain update vw_test set id = 4;
> QUERY PLAN
> ---------------------------------------------------------------------------
> Update on test_tablesample (cost=0.00..4.04 rows=4 width=210)
> -> Sample Scan on test_tablesample (cost=0.00..4.04 rows=4 width=210)
> (2 rows)
>

Right, I'll make those views not auto-updatable.

>
> > Standard is somewhat useless for UPDATE and DELETE as it only defines
> quite limited syntax there. From what I've seen when doing research
> MSSQL also only supports it in their equivalent of FROM/USING list,
> Oracle does not seem to support their SAMPLING clause outside of SELECTs
> at all and if I got the cryptic DB2 manual correctly I think they don't
> support it outside of (sub)SELECTs either.
> >
>
> By the way, what is the usecase to support sample scan in
> Update or Delete statement?
>

Well for the USING/FROM part the use-case is same as for SELECT -
providing sample of the data for the query (it can be useful also for
getting pseudo random rows fast). And if we didn't support it, it could
still be done using sub-select so why not have it directly.

> Also, isn't it better to mention in the docs for Update and
> Delete incase we are going to support tablesample clause
> for them?
>

Most of other clauses that we support in FROM are not mentioned in
UPDATE/DELETE docs, both of those commands just say something like
"refer to the SELECT FROM docs for more info". Do you think TABLESAMPLE
deserves special treatment in this regard?

> >
> >> 7.
> >> ParseTableSample()
> >> {
> >> ..
> >> arg = coerce_to_specific_type(pstate, arg, INT4OID, "REPEATABLE");
> >> ..
> >> }
> >>
> >> What is the reason for coercing value of REPEATABLE clause to INT4OID
> >> when numeric value is expected for the clause. If user gives the
> >> input value as -2.3, it will generate a seed which doesn't seem to
> >> be right.
> >>
> >
> > Because the REPEATABLE is numeric expression so it can produce
> whatever number but we need int4 internally (well float4 would also work
> just the code would be slightly uglier).
> >
>
> Okay, I understand that part. Here the real point is why not just expose
> it as int4 to user rather than telling in docs that it is numeric and
> actually we neither expect nor use it as numberic.
>
> Even Oracle supports supports it as int with below description.
> The seed_value must be an integer between 0 and 4294967295

Well the thing with SQL Standard's "numeric value expression" is that it
actually does not mean numeric data type, it's just simple arithmetic
expression with some given rules (by the standard), but the output data
type can be either implementation specific approximate number or
implementation specific exact number (depending on inputs by standard's
definition, but meh). We support a_expr instead which gives much more
flexibility on input. For now I changed wording of the docs to say that
input is a number instead of using word numeric there.

>
> > And we do this type of coercion even for table data (you can insert
> -2.3 into integer column and it will work) so I don't see what's wrong
> with it here.
> >
>
> I am not sure we can compare it with column of a table. I think we
> can support it within a valid range (similar to tablesample method) and
> if user inputs value outside the range, then return error.
>

But that's not what standard says, it says any numeric value expression
is valid. The fact that Oracle limits it to some range should not make
us do the same. I think most important thing here is that using -2.3
will produce same results if called repeatedly (if there are no changes
to data, vacuum etc). Yes passing -2 will produce same results, I don't
know if that is a problem. The main reason why I have the coercion there
is so that users don't have to explicitly typecast expression results.

> >>
> >> 8.
> >> +DATA(insert OID = 3295 ( tsm_system_initPGNSP PGUID 12 1 0 0 0 f f f f
> >> t f v 3 0 2278 "2281
> >> 23 700" _null_ _null_ _null_ _null_tsm_system_init _null_ _null_
> _null_ ));
> >> +DATA(insert OID = 3301 ( tsm_bernoulli_initPGNSP PGUID 12 1 0 0 0 f f
> >> f f t f v 3 0 2278 "2281
> >> 23 700" _null_ _null_ _null_ _null_tsm_bernoulli_init _null_ _null_
> >> _null_ ));
> >>
> >> Datatype for second argument is kept as 700 (Float4), shouldn't
> >> it be 1700 (Numeric)?
> >>
> >
> > Why is that?
> >
>
> As we are exposing it as numeric.
>

See my comment for the REPEATABLE. Checking the docs, I actually wrote
there "floating point" so hopefully it's not confusing.

> >Given the sampling error I think the float4 is enough for specifying
> the percentage and it makes the calculations much easier and faster than
> dealing with Numeric would.
> >
>
> Your explanation makes sense to me and we can leave it as it is.

Cool.

>
> One more point:
>
> - [ ONLY ] <replaceable class="parameter">table_name</replaceable> [
> * ] [ [ AS ] <replaceable
> class="parameter">alias</replaceable> [ ( <replaceable
> class="parameter">column_alias</replaceable> [, ...] )
> ] ]
> + [ ONLY ] <replaceable class="parameter">table_name</replaceable> [
> * ] [ TABLESAMPLE <replaceable
> class="parameter">sampling_method</replaceable> ( <replaceable
> class="parameter">argument</replaceable> [,
> ...] ) [ REPEATABLE ( <replaceable class="parameter">seed</replaceable>
> ) ] ] [ [ AS ] <replaceable
> class="parameter">alias</replaceable> [ ( <replaceable
> class="parameter">column_alias</replaceable> [, ...] )
> ] ]
>
> In documentation, AS is still after TABLESAMPLE clause even
> though you have already changed gram.y for the same.
>

Ah right, thanks.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2015-04-06 12:30:24 Re: TABLESAMPLE patch
Previous Message Fujii Masao 2015-04-06 12:10:14 Re: pg_rewind and log messages