Re: TABLESAMPLE patch

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: TABLESAMPLE patch
Date: 2014-12-22 12:35:34
Message-ID: 54981016.803@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 22.12.2014 10:07, Petr Jelinek wrote:
> On 21/12/14 18:38, Tomas Vondra wrote:
>>
>> (1) The patch adds a new catalog, but does not bump CATVERSION.
>>
>
> I thought this was always done by committer?

Right. Sorry for the noise.

>
>> (2) The catalog naming (pg_tablesamplemethod) seems a bit awkward,
>> as it squishes everything into a single chunk. That's inconsistent
>> with naming of the other catalogs. I think pg_table_sample_method
>> would be better.
>
> Fair point, but perhaps pg_tablesample_method then as tablesample is
> used as single word everywhere including the standard.

Sounds good.

>> (4) I noticed there's an interesting extension in SQL Server, which
>> allows specifying PERCENT or ROWS, so you can say
>>
>> SELECT * FROM table TABLESAMPLE SYSTEM (25 PERCENT);
>>
>> or
>>
>> SELECT * FROM table TABLESAMPLE SYSTEM (2500 ROWS);
>>
>> That seems handy, and it'd make migration from SQL Server easier.
>> What do you think?
>
> Well doing it exactly this way somewhat kills the extensibility which
> was one of the main goals for me - I allow any kind of parameters for
> sampling and the handling of those depends solely on the sampling
> method. This means that in my approach if you'd want to change what you
> are limiting you'd have to write new sampling method and the query would
> then look like SELECT * FROM table TABLESAMPLE SYSTEM_ROWLIMIT (2500);
> or some such (depending on how you name the sampling method). Or SELECT
> * FROM table TABLESAMPLE SYSTEM (2500, 'ROWS'); if we chose to extend
> the SYSTEM sampling method, that would be also doable.
>
> The reason for this is that I don't want to really limit too much what
> parameters can be send to sampling (I envision even SYSTEM_TIMED
> sampling method that will get limit as time interval for example).

Good point.

>>
>> (6) This seems slightly wrong, because of long/uint32 mismatch:
>>
>> long seed = PG_GETARG_UINT32(1);
>>
>> I think uint32 would be more appropriate, no?
>
> Probably, although I need long later in the algorithm anyway.

Really? ISTM the sampler_setseed() does not really require long, uint32
would work exactly the same.

>>
>> (9) The current regression tests only use the REPEATABLE cases. I
>> understand queries without this clause are RANDOM, but maybe we
>> could do something like this:
>>
>> SELECT COUNT(*) BETWEEN 5000 AND 7000 FROM (
>> SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50)
>> ) foo;
>>
>> Granted, there's still a small probability of false positive, but
>> maybe that's sufficiently small? Or is the amount of code this
>> tests negligible?
>
> Well, depending on fillfactor and limit it could be made quite reliable
> I think, I also want to add test with VIEW (I think v2 has a bug there)
> and perhaps some JOIN.

OK.

>>
>> (10) In the initial patch you mentioned it's possible to write custom
>> sampling methods. Do you think a CREATE TABLESAMPLE METHOD,
>> allowing custom methods implemented as extensions would be useful?
>>
>
> Yes, I think so, CREATE/DROP TABLESAMPLE METHOD is on my TODO, but since
> that's just simple mechanical work with no hard problems to solve there
> I can add it later once we have agreement on the general approach of the
> patch (especially in the terms of extensibility).

OK, good to know.

Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2014-12-22 12:47:11 Re: pgbench -f and vacuum
Previous Message Etsuro Fujita 2014-12-22 11:20:25 Re: inherit support for foreign tables