Re: More stable query plans via more predictable column statistics

From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: More stable query plans via more predictable column statistics
Date: 2016-04-02 18:57:53
Message-ID: CACACo5Qmkt1fdQrGJ_mcLJhJmJ2YpXX-7iB4qen+4sni=YPeRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Apr 2, 2016 18:38, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
> > On Apr 1, 2016 23:14, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Haven't looked at 0002 yet.
>
> > [crosses fingers] hope you'll have a chance to do that before feature
> > freeze for 9.6
>
> I studied this patch for awhile after rebasing it onto yesterday's
> commits.

Fantastic! I could not hope for a better reply :-)

> I did not like the fact that the compute_scalar_stats logic
> would allow absolutely anything into the MCV list once num_hist falls
> below 2. I think it's important that we continue to reject values that
> are only seen once in the sample, because there's no very good reason to
> think that they are MCVs and not just infrequent values that by luck
> appeared in the sample.

In my understanding we only put a value in the track list if we've seen it
at least twice, no?

> However, after I rearranged the tests there so
> that "if (num_hist >= 2)" only controlled whether to apply the 1/K limit,
> one of the regression tests started to fail:

Uh-oh.

> there's a place in
> rowsecurity.sql that expects that if a column contains nothing but several
> instances of a single value, that value will be recorded as a lone MCV.
> Now this isn't a particularly essential thing for that test, but it still
> seems like a good property for ANALYZE to have.

No objection here.

> The reason it's failing,
> of course, is that the test as written cannot possibly accept the last
> (or only) value.

Yeah, this I would expect from such a change.

> Before I noticed the regression failure, I'd been thinking that maybe it'd
> be better if the decision rule were not "at least 100+x% of the average
> frequency of this value and later ones", but "at least 100+x% of the
> average frequency of values after this one".

Hm, sounds pretty similar to what I wanted to achieve, but better
formalized.

> With that formulation, we're
> not constrained as to the range of x. Now, if there are *no* values after
> this one, then this way needs an explicit special case in order not to
> compute 0/0; but the preceding para shows that we need a special case for
> the last value anyway.
>
> So, attached is a patch rewritten along these lines. I used 50% rather
> than 25% as the new cutoff percentage --- obviously it should be higher
> in this formulation than before, but I have no idea if that particular
> number is good or we should use something else. Also, the rule for the
> last value is "at least 1% of the non-null samples". That's a pure guess
> as well.
>
> I do not have any good corpuses of data to try this on. Can folks who
> have been following this thread try it on their data and see how it
> does? Also please try some other multipliers besides 1.5, so we can
> get a feeling for where that cutoff should be placed.

Expect me to run it on my pet db early next week. :-)

Many thanks!
--
Alex

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2016-04-02 20:21:11 Re: Batch update of indexes
Previous Message Tom Lane 2016-04-02 18:57:21 Re: Batch update of indexes