Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-17 12:04:14
Message-ID: CAKJS1f81TbFHJtfzYYVH19NWroxX4NOyZ2vvd5y+jb6VLr3WRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've started looking over 0002. Here are a few things so far:

1. I think this should be pg_statistic_ext.stxhistogram?

Values of the <type>pg_histogram</type> can be obtained only from the
<literal>pg_statistic.stxhistogram</literal> column.

2. I don't think this bms_copy is needed anymore. I think it was
previously since there were possibly multiple StatisticExtInfo objects
per pg_statistic_ext row, but now it's 1 for 1.

+ info->keys = bms_copy(keys);

naturally, the bms_free() will need to go too.

3. I've not really got into understanding how the new statistics types
are applied yet, but I found this:

* If asked to build both MCV and histogram, first build the MCV part
* and then histogram on the remaining rows.

I guess that means we'll get different estimates with:

create statistic a_stats (mcv,histogram) on a,b from t;

vs

create statistic a_stats1 (mcv) on a,b from t;
create statistic a_stats2 (histogram) on a,b from t;

Is that going to be surprising to people?

4. I guess you can replace "(histogram == NULL);" with "false". The
compiler would likely do it anyway, but...

if (histogram != NULL)
{
/* histogram already is a bytea value, not need to serialize */
nulls[Anum_pg_statistic_ext_stxhistogram - 1] = (histogram == NULL);
values[Anum_pg_statistic_ext_stxhistogram - 1] = PointerGetDatum(histogram);
}

but, hmm. Shouldn't you serialize this, like you are with the others?

5. serialize_histogram() and statext_histogram_deserialize(), should
these follow the same function naming format?

6. IIRC some compilers may warn about this:

if (stat->kinds & requiredkinds)

making it:

if ((stat->kinds & requiredkinds))

should fix that.

UPDATE: Tried to make a few compilers warn about this and failed.
Perhaps I've misremembered.

7. Comment claims function has a parameter named 'requiredkind', but
it no longer does. The comment also needs updated to mention that it
finds statistics with any of the required kinds.

* choose_best_statistics
* Look for and return statistics with the specified 'requiredkind' which
* have keys that match at least two of the given attnums. Return NULL if
* there's no match.
*
* The current selection criteria is very simple - we choose the statistics
* object referencing the most of the requested attributes, breaking ties
* in favor of objects with fewer keys overall.
*
* XXX If multiple statistics objects tie on both criteria, then which object
* is chosen depends on the order that they appear in the stats list. Perhaps
* further tiebreakers are needed.
*/
StatisticExtInfo *
choose_best_statistics(List *stats, Bitmapset *attnums, int requiredkinds)

8. Looking at statext_clauselist_selectivity() I see it calls
choose_best_statistics() passing requiredkinds as STATS_EXT_INFO_MCV |
STATS_EXT_INFO_HISTOGRAM, do you think the function now needs to
attempt to find the best match plus the one with the most statistics
kinds?

It might only matter if someone had:

create statistic a_stats1 (mcv) on a,b from t;
create statistic a_stats2 (histogram) on a,b from t;
create statistic a_stats3 (mcv,histogram) on a,b from t;

Is it fine to just return a_stats1 and ignore the fact that a_stats3
is probably better? Or too corner case to care?

9. examine_equality_clause() assumes it'll get a Var. I see we should
only allow clauses that pass statext_is_compatible_clause_internal(),
so maybe it's worth an Assert(IsA(var, Var)) along with a comment to
mention anything else could not have been allowed.

10. Does examine_equality_clause need 'root' as an argument?

11. UINT16_MAX -> PG_UINT16_MAX

/* make sure we fit into uint16 */
Assert(count <= UINT16_MAX);

(Out of energy for today.)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Surafel Temesgen 2019-01-17 12:14:53 Re: pg_dump multi VALUES INSERT
Previous Message Laurenz Albe 2019-01-17 10:59:18 Re: Libpq support to connect to standby server as priority