Re: More stable query plans via more predictable column statistics

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2016-04-02 16:37:53
Message-ID: 16161.1459615073@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
> On Apr 1, 2016 23:14, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Haven't looked at 0002 yet.

> [crosses fingers] hope you'll have a chance to do that before feature
> freeze for 9.6

I studied this patch for awhile after rebasing it onto yesterday's
commits. I did not like the fact that the compute_scalar_stats logic
would allow absolutely anything into the MCV list once num_hist falls
below 2. I think it's important that we continue to reject values that
are only seen once in the sample, because there's no very good reason to
think that they are MCVs and not just infrequent values that by luck
appeared in the sample. However, after I rearranged the tests there so
that "if (num_hist >= 2)" only controlled whether to apply the 1/K limit,
one of the regression tests started to fail: there's a place in
rowsecurity.sql that expects that if a column contains nothing but several
instances of a single value, that value will be recorded as a lone MCV.
Now this isn't a particularly essential thing for that test, but it still
seems like a good property for ANALYZE to have. The reason it's failing,
of course, is that the test as written cannot possibly accept the last
(or only) value.

Before I noticed the regression failure, I'd been thinking that maybe it'd
be better if the decision rule were not "at least 100+x% of the average
frequency of this value and later ones", but "at least 100+x% of the
average frequency of values after this one". With that formulation, we're
not constrained as to the range of x. Now, if there are *no* values after
this one, then this way needs an explicit special case in order not to
compute 0/0; but the preceding para shows that we need a special case for
the last value anyway.

So, attached is a patch rewritten along these lines. I used 50% rather
than 25% as the new cutoff percentage --- obviously it should be higher
in this formulation than before, but I have no idea if that particular
number is good or we should use something else. Also, the rule for the
last value is "at least 1% of the non-null samples". That's a pure guess
as well.

I do not have any good corpuses of data to try this on. Can folks who
have been following this thread try it on their data and see how it
does? Also please try some other multipliers besides 1.5, so we can
get a feeling for where that cutoff should be placed.

regards, tom lane

Attachment Content-Type Size
better-MCV-cutoff-1.patch text/x-diff 10.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-04-02 17:20:59 Re: Transactional enum additions - was Re: Alter or rename enum value
Previous Message Andrew Dunstan 2016-04-02 15:37:59 Transactional enum additions - was Re: Alter or rename enum value