Re: TABLESAMPLE patch is really in pretty sad shape

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TABLESAMPLE patch is really in pretty sad shape
Date: 2015-07-16 11:55:47
Message-ID: 55A79BC3.30301@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-07-13 15:39, Tom Lane wrote:
>
> Datum
> tsm_system_rows_init(PG_FUNCTION_ARGS)
> {
> TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0);
> uint32 seed = PG_GETARG_UINT32(1);
> int32 ntuples = PG_ARGISNULL(2) ? -1 : PG_GETARG_INT32(2);
>
> This is rather curious coding. Why should this function check only
> one of its arguments for nullness? If it needs to defend against
> any of them being null, it really needs to check all. But in point of
> fact, this function is declared STRICT, which means it's a violation of
> the function call protocol if the core code ever passes a null to it,
> and so this test ought to be dead code.
>
> A similar pattern of ARGISNULL checks in declared-strict functions exists
> in all the tablesample modules, not just this one, showing that this is an
> overall design error not just a thinko here. My inclination would be to
> make the core code enforce non-nullness of all tablesample arguments so as
> to make it unnecessary to check strictness of the tsm*init functions, but
> in any case it is Not Okay to just pass nulls willy-nilly to strict C
> functions.
>

The reason for this ugliness came from having to have balance between
modularity and following the SQL Standard error codes for BERNOULLI and
SYSTEM, which came as issue during reviews (the original code did the
checks before calling the sampling method's functions but produced just
generic error code about wrong parameter). I considered it as okayish
mainly because those functions are not SQL callable and the code which
calls them knows how to handle it correctly, but I understand why you don't.

I guess if we did what you proposed in another email in this thread -
don't have the API exposed on SQL level, we could send the additional
parameters as List * and leave the validation completely to the
function. (And maybe don't allow NULLs at all)

> Also, I find this coding pretty sloppy even without the strictness angle,
> because the net effect of this way of dealing with nulls is that an
> argument-must-not-be-null complaint is reported as "out of range",
> which is both confusing and the wrong ERRCODE.
>
> if (ntuples < 1)
> ereport(ERROR,
> (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> errmsg("invalid sample size"),
> errhint("Sample size must be positive integer value.")));
>
> I don't find this to be good error message style. The secondary comment
> is not a "hint", it's an ironclad statement of what you did wrong, so if
> we wanted to phrase it like this it should be an errdetail not errhint.
> But the whole thing is overly cute anyway because there is no reason at
> all not to just say what we mean in the primary error message, eg
> ereport(ERROR,
> (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
> errmsg("sample size must be greater than zero")));
>

Same as above, although now that I re-read the standard I am sure I
misunderstand it the first time - it says:
"If S is the null value or if S < 0 (zero) or if S > 100, then an
exception condition is raised: data exception — invalid sample size."

I took it as literal error message originally but they just mean status
code by it.

> While we're on the subject, what's the reason for disallowing a sample
> size of zero? Seems like a reasonable edge case.

Yes that's a bug.

>
> /* All blocks have been read, we're done */
> if (sampler->doneblocks > sampler->nblocks ||
> sampler->donetuples >= sampler->ntuples)
> PG_RETURN_UINT32(InvalidBlockNumber);
>
> Okay, I lied, I *am* going to complain about this comment. Comments that
> do not accurately describe the code they're attached to are worse than
> useless.
>

That's copy-pasto from the tsm_system_time.

> In short, I'd do something more like
>
> uint32 r;
>
> /* Safety check to avoid infinite loop or zero result for small n. */
> if (n <= 1)
> return 1;
>
> /*
> * This should only take 2 or 3 iterations as the probability of 2 numbers
> * being relatively prime is ~61%; but just in case, we'll include a
> * CHECK_FOR_INTERRUPTS in the loop.
> */
> do {
> CHECK_FOR_INTERRUPTS();
> r = (uint32) (sampler_random_fract(randstate) * n);
> } while (r == 0 || gcd(r, n) > 1);
>
> Note however that this coding would result in an unpredictable number
> of iterations of the RNG, which might not be such a good thing if we're
> trying to achieve repeatability. It doesn't matter in the context of
> this module since the RNG is not used after initialization, but it would
> matter if we then went on to do Bernoulli-style sampling. (Possibly that
> could be dodged by reinitializing the RNG after the initialization steps.)
>

Bernoulli-style sampling does not need this kind of code so it's not
really an issue. That is unless you'd like to combine the linear probing
and bernoulli of course, but I don't see any benefit in doing that.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2015-07-16 12:17:31 Re: optimizing vacuum truncation scans
Previous Message Kyotaro HORIGUCHI 2015-07-16 11:51:57 Re: multivariate statistics / patch v7