Re: MCV lists for highly skewed distributions

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, John Naylor <jcnaylor(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-02-01 17:49:22
Message-ID: CA+TgmoZOHwXjyrw4Fi-Lrdz5ChB6GN77xg4adJQvgg70_9kjAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 1, 2018 at 12:21 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> For a highly skewed distribution, it is possible for there to be
> hardly any values (maybe only one) that appears more than 1.25 times
> the average frequency, and so lots of otherwise perfectly good common
> values get discarded. For such a distribution, I don't think that the
> average frequency is particularly interesting, and it shouldn't be
> used to filter the MCV list.

Although I don't think I have enough math background to analyze this
as rigorously as you, I agree with you on this point: the average
frequency doesn't seem very interesting.

One point which I want to emphasize is that the length of the MCV list
bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
ever thought to be more frequent than the least-common MCVs, and
however many non-MCVs we think we have (probably fewer than we
actually have) have to fit into whatever percentage of the table is
consumed by MCVs. This would be less important if we had reliable
n_distinct estimates, but we don't. So, even throwing things into the
MCV list that are no more common than the average item can improve
planning in some cases.

> Testing it with the example from [1], it does indeed appear to solve
> the too-many-MCVs problem in that particular case (actually generating
> no MCVs).
>
> Testing it with the highly skewed example at the start of this thread,
> it typically generates a couple more MCVs, producing somewhat better
> estimates, but to get really good estimates for who=17, you need to
> crank up the stats target. It does at least appear to no longer be the
> case that cranking up the stats target has a weak effect.

Hmm, nice result. I agree with you that it would be nice if we can
come up with a good general-purpose algorithm for this, rather than
making it pluggable. I don't know whether we can succeed in doing
that but we should try hard, because it's always better when things
just work automatically...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-02-01 18:45:47 Re: [HACKERS] why not parallel seq scan for slow functions
Previous Message Jeff Davis 2018-02-01 17:32:17 Re: JIT compiling with LLVM v9.0