Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-09 18:33:56
Message-ID: CAEZATCWAT3hEvWv2JoMfioroy41eKb7roCaM3D0b-BDAoKTbFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 28 Feb 2019 at 19:56, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Attached is an updated version of this patch series.

Here are some random review comments. I'll add more later, but I'm out
of energy for today.

1). src/test/regress/expected/type_sanity.out has bit-rotted.

2). Duplicate OIDs (3425).

3). It looks a bit odd that clauselist_selectivity() calls
statext_clauselist_selectivity(), which does MCV stats and will do
histograms, but it doesn't do dependencies, so
clauselist_selectivity() has to then separately call
dependencies_clauselist_selectivity(). It would seem neater if
statext_clauselist_selectivity() took care of calling
dependencies_clauselist_selectivity(), since dependencies are just
another kind of extended stats.

4). There are no tests for pg_mcv_list_items(). Given a table with a
small enough amount of data, so that it's all sampled, it ought to be
possible to get predictable MCV stats.

5). It's not obvious what some of the new test cases in the
"stats_ext" tests are intended to show. For example, the first test
creates a table with 5000 rows and a couple of indexes, does a couple
of queries, builds some MCV stats, and then repeats the queries, but
the results seem to be the same with and without the stats.

I wonder if it's possible to write smaller, more targeted tests.
Currently "stats_ext" is by far the slowest test in its group, and I'm
not sure that some of those tests add much. It ought to be possible to
write a function that calls EXPLAIN and returns a query's row
estimate, and then you could write tests to confirm the effect of the
new stats by verifying the row estimates change as expected.

6). This enum isn't needed for MCVs:

/*
* Degree of how much MCV item matches a clause.
* This is then considered when computing the selectivity.
*/
#define STATS_MATCH_NONE 0 /* no match at all */
#define STATS_MATCH_PARTIAL 1 /* partial match */
#define STATS_MATCH_FULL 2 /* full match */

STATS_MATCH_PARTIAL is never used for MCVs, so you may as well just
use booleans instead of this enum. If those are needed for histograms,
they can probably be made local to histogram.c.

7). estimate_num_groups_simple() isn't needed in this patch.

8). In README.mcv,
s/clauselist_mv_selectivity_mcvlist/mcv_clauselist_selectivity/.

9). In the list of supported clause types that follows (e) is the same
a (c), but with a more general description.

10). It looks like most of the subsequent description of the algorithm
is out of date and needs rewriting. All the stuff about full matches
and the use of ndistinct is now obsolete.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2019-03-09 18:41:45 Re: Checksum errors in pg_stat_database
Previous Message Tom Lane 2019-03-09 17:55:54 Re: Should we increase the default vacuum_cost_limit?