Re: TABLESAMPLE patch

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-01-09 08:27:49
Message-ID: CAB7nPqQtZWe+d+VOmggFik0j9zh50=h4kYmOHGkKebx28JsU_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 9, 2015 at 1:10 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> On 06/01/15 14:22, Petr Jelinek wrote:
>>
>> On 06/01/15 08:51, Michael Paquier wrote:
>>>
>>> On Tue, Dec 23, 2014 at 5:21 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com>
>>> wrote:
>>>>
>>>> Attached is v3 which besides the fixes mentioned above also includes
>>>> changes
>>>> discussed with Tomas (except the CREATE/DROP TABLESAMPLE METHOD),
>>>> fixes for
>>>> crash with FETCH FIRST and is rebased against current master.
>>>
>>> This patch needs a rebase, there is a small conflict in
>>> parallel_schedule.
>>>
>>
>> Sigh, I really wish we had automation that checks this automatically for
>> patches in CF.
>>
>
> Here is rebase against current master.
Thanks!

>
>>> Structurally speaking, I think that the tsm methods should be added in
>>> src/backend/utils and not src/backend/access which is more high-level
>>> as tsm_bernoulli.c and tsm_system.c contain only a set of new
>>
>>
>> I am not sure if I parsed this correctly, do you mean to say that only
>> low-level access functions belong to src/backend/access? Makes sense.
>
>
> Made this change.
>
>>
>>> procedure functions. Having a single header file tsm.h would be also a
>>> good thing.
>>
>>
>
> Moved into single tablesample.h file.
>
>>
>>> Regarding the naming, is "tsm" (table sample method) really appealing?
>>> Wouldn't it be better to use simply tablesample_* for the file names
>>> and the method names?
>>>
>
> I created the src/backend/tablesample and files are named just system.c and
> bernoulli.c, but I kept tsm_ prefix for methods as they become too long for
> my taste when prefixing with tablesample_.
>
>>> This is a large patch... Wouldn't sampling.[c|h] extracted from
>>> ANALYZE live better as a refactoring patch? This would limit a bit bug
>>> occurrences on the main patch.
>>>
>>
>> That's a good idea, I'll split it into patch series.
>>
>
> I've split the sampling.c/h into separate patch.
>
> I also wrote basic CREATE/DROP TABLESAMPLE METHOD support, again as separate
> patch in the attached patch-set. This also includes modules test with simple
> custom tablesample method.
>
> There are some very minor cleanups in the main tablesample code itself but
> no functional changes.

Some comments about the 1st patch:
1) Nitpicking:
+ * Block sampling routines shared by ANALYZE and TABLESAMPLE.
TABLESAMPLE is not added yet. You may as well mention simplify this
description with something like "Sampling routines for relation
blocks".

2) file_fdw and postgres_fdw do not compile correctly as they still
use anl_random_fract. This function is still mentioned in vacuum.h as
well.

3) Not really an issue of this patch, but I'd think that this comment
should be reworked:
+ * BlockSampler is used for stage one of our new two-stage tuple
+ * sampling mechanism as discussed on pgsql-hackers 2004-04-02 (subject
+ * "Large DB"). It selects a random sample of samplesize blocks out of
+ * the nblocks blocks in the table. If the table has less than
+ * samplesize blocks, all blocks are selected.

4) As a refactoring patch, why is the function providing a random
value changed? Shouldn't sample_random_fract be consistent with
anl_random_fract?

5) The seed numbers can be changed to RAND48_SEED_0, RAND48_SEED_1 and
RAND48_SEED_2 instead of being hardcoded?
Regards,
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2015-01-09 08:30:03 Re: Async execution of postgres_fdw.
Previous Message Dean Rasheed 2015-01-09 08:19:10 Re: INSERT ... ON CONFLICT UPDATE and RLS