Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-10 13:09:13
Message-ID: CAEZATCVG+xCQc8NGx_LCY8xr-RcwaxQLfibkMnTT5A60DVturg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 9 Mar 2019 at 18:33, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>
> On Thu, 28 Feb 2019 at 19:56, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > Attached is an updated version of this patch series.
>
> Here are some random review comments. I'll add more later, but I'm out
> of energy for today.
>

Here are some more comments:

11). In dependency_degree():

- /* sort the items so that we can detect the groups */
- qsort_arg((void *) items, numrows, sizeof(SortItem),
- multi_sort_compare, mss);
+ /*
+ * build an array of SortItem(s) sorted using the multi-sort support
+ *
+ * XXX This relies on all stats entries pointing to the same tuple
+ * descriptor. Not sure if that might not be the case.
+ */
+ items = build_sorted_items(numrows, rows, stats[0]->tupDesc,
+ mss, k, attnums_dep);

That XXX comment puzzled me for a while. Actually it's OK though,
unless/until we try to support stats across multiple relations, which
will require a much larger refactoring of this code. For now though,
The stats entries all point to the same tuple descriptor from the
onerel passed to BuildRelationExtStatistics(), so it's OK to just use
the first tuple descriptor in this way. The comment should be updated
to explain that.

12). bms_member_index() should surely be in bitmapset.c. It could be
more efficient by just traversing the bitmap words and making use of
bmw_popcount(). Also, its second argument should be of type 'int' for
consistency with other bms_* functions.

13). estimate_ndistinct() has been moved from mvdistinct.c to
extended_stats.c and changed from static to extern, but it is only
called from mvdistinct.c, so that change is unnecessary (at least as
far as this patch is concerned).

14). The attnums Bitmapset passed to
statext_is_compatible_clause_internal() is an input/output argument
that it updates. That should probably be documented. When it calls
itself recursively for AND/OR/NOT clauses, it could just pass the
original Bitmapset through to be updated (rather than creating a new
one and merging it), as it does for other types of clause.

On the other hand, the outer function statext_is_compatible_clause()
does need to return a new bitmap, which may or may not be used by its
caller, so it would be cleaner to make that a strictly "out" parameter
and initialise it to NULL in that function, rather than in its caller.

15). As I said yesterday, I don't think that there is a clean
separator of concerns between the functions clauselist_selectivity(),
statext_clauselist_selectivity(),
dependencies_clauselist_selectivity() and
mcv_clauselist_selectivity(), I think things could be re-arranged as
follows:

statext_clauselist_selectivity() - as the name suggests - should take
care of *all* extended stats estimation, not just MCVs and histograms.
So make it a fairly small function, calling
mcv_clauselist_selectivity() and
dependencies_clauselist_selectivity(), and histograms when that gets
added.

Most of the current code in statext_clauselist_selectivity() is really
MCV-specific, so move that to mcv_clauselist_selectivity(). Amongst
other things, that would move the call to choose_best_statistics() to
mcv_clauselist_selectivity() (just as
dependencies_clauselist_selectivity() calls choose_best_statistics()
to get the best dependencies statistics). Then, when histograms are
added later, you won't have the problem pointed out before where it
can't apply both MCV and histogram stats if they're on different
STATISTICS objects.

Most of the comments for statext_clauselist_selectivity() are also
MCV-related. Those too would move to mcv_clauselist_selectivity().

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2019-03-10 13:10:43 Re: Should we increase the default vacuum_cost_limit?
Previous Message Julien Rouhaud 2019-03-10 12:13:50 Re: Checksum errors in pg_stat_database