Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-26 11:59:56
Message-ID: CAEZATCURhZv0kMsy9PoWvx+4nuvmfGi42f9kte7ZNyWOJRKKjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 25 Mar 2019 at 23:36, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> Attached is an updated patch, fixing all the issues pointed out so far.
> Unless there are some objections, I plan to commit the 0001 part by the
> end of this CF. Part 0002 is a matter for PG13, as previously agreed.
>

Yes, I think that's reasonable. It looks to be in pretty good shape. I
have reviewed most of the actual code, but note that I haven't
reviewed the docs changes and I didn't spend much time reading code
comments. It might benefit from a quick once-over comment tidy up.

I just looked through the latest set of changes and I have a couple of
additional review comments:

In the comment about WIDTH_THRESHOLD, s/pg_statistic/pg_statistic_ext/.

In statext_mcv_build(), I'm not convinced by the logic around keeping
the whole MCV list if it fits. Suppose there were a small number of
very common values, and then a bunch of uniformly distributed less
common values. The sample might consist of all the common values, plus
one or two instances of some of the uncommon ones, leading to a list
that would fit, but it would not be appropriate to keep the uncommon
values on the basis of having seen them only one or two times. The
fact that the list of items seen fits doesn't by itself mean that
they're all common enough to justify being kept. In the per-column
stats case, there are a bunch of other checks that have to pass, which
are intended to test not just that the list fits, but that it believes
that those are all the items in the table. For MV stats, you don't
have that, and so I think it would be best to just remove that test
(the "if (ngroups > nitems)" test) and *always* call
get_mincount_for_mcv_list() to determine how many MCV items to keep.
Otherwise there is a risk of keeping too many MCV items, with the ones
at the tail end of the list producing poor estimates.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Surafel Temesgen 2019-03-26 12:06:52 Re: FETCH FIRST clause PERCENT option
Previous Message Yugo Nagata 2019-03-26 11:45:19 Re: Improvement of installation document