Re: TABLESAMPLE patch

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-04-06 10:33:06
Message-ID: CAA4eK1JvNCBg8ObYoa23g+7gzWM5kg56hMhxndONsaaqAas_qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 4, 2015 at 8:25 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>
> On 04/04/15 14:57, Amit Kapila wrote:
>>
>>
>> 1.
>> +tablesample_clause:
>> +TABLESAMPLE ColId '(' func_arg_list ')' opt_repeatable_clause
>>
>> It seems to me that you want to allow it to make it extendable
>> to user defined Tablesample methods, but not sure if it is
>> right to use func_arg_list for the same because sample method
>> arguments can have different definition than function arguments.
>> Now even if we want to keep sample method arguments same as
>> function arguments that sounds like a separate discussion.
>>
>
> Yes I want extensibility here. And I think the tablesample method
arguments are same thing as function arguments given that in the end they
are arguments for the init function of tablesampling method.
>
> I would be ok with just expr_list, naming parameters here isn't overly
important, but ability to have different types and numbers of parameters
for custom TABLESAMPLE methods *is* important.
>

Yeah, named parameters is one reason which I think won't
be required for sample methods and neither the same is
mentioned in docs (if user has to use, what is the way he
can pass the same) and another is number of arguments
for sampling methods which is currently seems to be same
as FUNC_MAX_ARGS, I think that is sufficient but do we
want to support that many arguments for sampling method.

I have shared my thoughts regarding this point with you, if
you don't agree with the same, then proceed as you think is
the best way.

>>
>> 2.
>> postgres=# explain update test_tablesample TABLESAMPLE system(30) set id
>> = 2;
>> ERROR: syntax error at or near "TABLESAMPLE"
>> LINE 1: explain update test_tablesample TABLESAMPLE system(30) set i...
>>
>> postgres=# Delete from test_tablesample TABLESAMPLE system(30);
>> ERROR: syntax error at or near "TABLESAMPLE"
>> LINE 1: Delete from test_tablesample TABLESAMPLE system(30);
>>
>> Isn't TABLESAMPLE clause suppose to work with Update/Delete
>> statements?
>>
>
> It's supported in the FROM part of UPDATE and USING part of DELETE. I
think that that's sufficient.
>

But I think the Update on target table with sample scan is
supported via views which doesn't seem to be the right thing
in case you just want to support it via FROM/USING, example

postgres=# create view vw_test As select * from test_tablesample
TABLESAMPLE sys
tem(30);
postgres=# explain update vw_test set id = 4;
QUERY PLAN
---------------------------------------------------------------------------
Update on test_tablesample (cost=0.00..4.04 rows=4 width=210)
-> Sample Scan on test_tablesample (cost=0.00..4.04 rows=4 width=210)
(2 rows)

> Standard is somewhat useless for UPDATE and DELETE as it only defines
quite limited syntax there. From what I've seen when doing research MSSQL
also only supports it in their equivalent of FROM/USING list, Oracle does
not seem to support their SAMPLING clause outside of SELECTs at all and if
I got the cryptic DB2 manual correctly I think they don't support it
outside of (sub)SELECTs either.
>

By the way, what is the usecase to support sample scan in
Update or Delete statement?

Also, isn't it better to mention in the docs for Update and
Delete incase we are going to support tablesample clause
for them?

>
>> 7.
>> ParseTableSample()
>> {
>> ..
>> arg = coerce_to_specific_type(pstate, arg, INT4OID, "REPEATABLE");
>> ..
>> }
>>
>> What is the reason for coercing value of REPEATABLE clause to INT4OID
>> when numeric value is expected for the clause. If user gives the
>> input value as -2.3, it will generate a seed which doesn't seem to
>> be right.
>>
>
> Because the REPEATABLE is numeric expression so it can produce whatever
number but we need int4 internally (well float4 would also work just the
code would be slightly uglier).
>

Okay, I understand that part. Here the real point is why not just expose
it as int4 to user rather than telling in docs that it is numeric and
actually we neither expect nor use it as numberic.

Even Oracle supports supports it as int with below description.
The seed_value must be an integer between 0 and 4294967295

> And we do this type of coercion even for table data (you can insert -2.3
into integer column and it will work) so I don't see what's wrong with it
here.
>

I am not sure we can compare it with column of a table. I think we
can support it within a valid range (similar to tablesample method) and
if user inputs value outside the range, then return error.

>>
>> 8.
>> +DATA(insert OID = 3295 ( tsm_system_initPGNSP PGUID 12 1 0 0 0 f f f f
>> t f v 3 0 2278 "2281
>> 23 700" _null_ _null_ _null_ _null_tsm_system_init _null_ _null_ _null_
));
>> +DATA(insert OID = 3301 ( tsm_bernoulli_initPGNSP PGUID 12 1 0 0 0 f f
>> f f t f v 3 0 2278 "2281
>> 23 700" _null_ _null_ _null_ _null_tsm_bernoulli_init _null_ _null_
>> _null_ ));
>>
>> Datatype for second argument is kept as 700 (Float4), shouldn't
>> it be 1700 (Numeric)?
>>
>
> Why is that?
>

As we are exposing it as numeric.

>Given the sampling error I think the float4 is enough for specifying the
percentage and it makes the calculations much easier and faster than
dealing with Numeric would.
>

Your explanation makes sense to me and we can leave it as it is.

One more point:

- [ ONLY ] <replaceable class="parameter">table_name</replaceable> [ * ]
[ [ AS ] <replaceable
class="parameter">alias</replaceable> [ ( <replaceable
class="parameter">column_alias</replaceable> [, ...] )
] ]
+ [ ONLY ] <replaceable class="parameter">table_name</replaceable> [ * ]
[ TABLESAMPLE <replaceable
class="parameter">sampling_method</replaceable> ( <replaceable
class="parameter">argument</replaceable> [,
...] ) [ REPEATABLE ( <replaceable class="parameter">seed</replaceable> ) ]
] [ [ AS ] <replaceable
class="parameter">alias</replaceable> [ ( <replaceable
class="parameter">column_alias</replaceable> [, ...] )
] ]

In documentation, AS is still after TABLESAMPLE clause even
though you have already changed gram.y for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2015-04-06 12:10:14 Re: pg_rewind and log messages
Previous Message Simon Riggs 2015-04-06 09:02:52 Re: TABLESAMPLE patch