Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Zhenghua Lyu <zlyu(at)vmware(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Should the function get_variable_numdistinct consider the case when stanullfrac is 1.0?
Date: 2020-10-31 00:50:49
Message-ID: 148741.1604105449@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
> So I'm not sure I understand what would be the risk with this ... Tom,
> can you elaborate why you dislike the patch?

I've got a couple issues with the patch as presented.

* As you said, it creates discontinuous behavior for stanullfrac = 1.0
versus stanullfrac = 1.0 - epsilon. That doesn't seem good.

* It's not apparent why, if ANALYZE's sample is all nulls, we wouldn't
conclude stadistinct = 0 and thus arrive at the desired answer that
way. (Since we have a complaint, I'm guessing that ANALYZE might
disbelieve its own result and stick in some larger stadistinct. But
then maybe that's where to fix this, not here.)

* We generally disbelieve edge-case estimates to begin with. The
most obvious example is that we don't accept rowcount estimates that
are zero. There are also some clamps that disbelieve selectivities
approaching 0.0 or 1.0 when estimating from a histogram, and I think
we have a couple other similar rules. The reason for this is mainly
that taking such estimates at face value creates too much risk of
severe relative error due to imprecise or out-of-date statistics.
So a special case for stanullfrac = 1.0 seems to go directly against
that mindset.

I agree that there might be some gold to be mined in this area,
as we haven't thought particularly hard about high-stanullfrac
situations. One idea is to figure what stanullfrac says about the
number of non-null rows, and clamp the get_variable_numdistinct
result to be not more than that. But I still would not want to
trust an exact zero result.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-10-31 01:03:49 Re: Consistent error reporting for encryption/decryption in pgcrypto
Previous Message Tomas Vondra 2020-10-30 23:49:48 Re: A couple questions about ordered-set aggregates