Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-26 15:12:35
Message-ID: 20190326151235.GA3088@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 26, 2019 at 11:59:56AM +0000, Dean Rasheed wrote:
>On Mon, 25 Mar 2019 at 23:36, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Attached is an updated patch, fixing all the issues pointed out so far.
>> Unless there are some objections, I plan to commit the 0001 part by the
>> end of this CF. Part 0002 is a matter for PG13, as previously agreed.
>>
>
>Yes, I think that's reasonable. It looks to be in pretty good shape. I
>have reviewed most of the actual code, but note that I haven't
>reviewed the docs changes and I didn't spend much time reading code
>comments. It might benefit from a quick once-over comment tidy up.
>
>I just looked through the latest set of changes and I have a couple of
>additional review comments:
>
>In the comment about WIDTH_THRESHOLD, s/pg_statistic/pg_statistic_ext/.
>
>In statext_mcv_build(), I'm not convinced by the logic around keeping
>the whole MCV list if it fits. Suppose there were a small number of
>very common values, and then a bunch of uniformly distributed less
>common values. The sample might consist of all the common values, plus
>one or two instances of some of the uncommon ones, leading to a list
>that would fit, but it would not be appropriate to keep the uncommon
>values on the basis of having seen them only one or two times. The
>fact that the list of items seen fits doesn't by itself mean that
>they're all common enough to justify being kept. In the per-column
>stats case, there are a bunch of other checks that have to pass, which
>are intended to test not just that the list fits, but that it believes
>that those are all the items in the table. For MV stats, you don't
>have that, and so I think it would be best to just remove that test
>(the "if (ngroups > nitems)" test) and *always* call
>get_mincount_for_mcv_list() to determine how many MCV items to keep.
>Otherwise there is a risk of keeping too many MCV items, with the ones
>at the tail end of the list producing poor estimates.
>

I need to think about it a bit, but I think you're right we may not need
to keep those items. The reason why I decided to keep them is that they
may represent cases where the (actual frequency << base frequency). But
now that I think about it, that can probably be handled as

((1 - total_sel) / ndistinct)

So yeah, I think I'll just change the code to always us the mincount.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2019-03-26 15:12:38 Re: New vacuum option to do only freezing
Previous Message Alexander Korotkov 2019-03-26 15:06:21 Re: jsonpath