Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-10 17:36:55
Message-ID: 5c865c03-9963-c62a-ef0c-c42f3c0e855f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Dean,

Thanks for the review. I'll post a patch fixing most of the stuff soon,
but a few comments/questions regarding some of the issues:

On 3/9/19 7:33 PM, Dean Rasheed wrote:
> 5). It's not obvious what some of the new test cases in the
> "stats_ext" tests are intended to show. For example, the first test
> creates a table with 5000 rows and a couple of indexes, does a couple
> of queries, builds some MCV stats, and then repeats the queries, but
> the results seem to be the same with and without the stats.
>

Hmmm. I thought those tests are testing that we get the right plan, but
maybe I broke that somehow during the rebases. Will check.

> I wonder if it's possible to write smaller, more targeted tests.
> Currently "stats_ext" is by far the slowest test in its group, and I'm
> not sure that some of those tests add much. It ought to be possible to
> write a function that calls EXPLAIN and returns a query's row
> estimate, and then you could write tests to confirm the effect of the
> new stats by verifying the row estimates change as expected.

Sure, if we can write more targeted tests, that would be good. But it's
not quite clear to me how wrapping EXPLAIN in a function makes those
tests any faster?

On 3/10/19 2:09 PM, Dean Rasheed wrote:
> 12). bms_member_index() should surely be in bitmapset.c. It could be
> more efficient by just traversing the bitmap words and making use of
> bmw_popcount(). Also, its second argument should be of type 'int' for
> consistency with other bms_* functions.

Yes, moving to bitmapset.c definitely makes sense. I don't see how it
could use bms_popcount() though.

On 3/10/19 2:09 PM, Dean Rasheed wrote:
> 14). The attnums Bitmapset passed to
> statext_is_compatible_clause_internal() is an input/output argument
> that it updates. That should probably be documented. When it calls
> itself recursively for AND/OR/NOT clauses, it could just pass the
> original Bitmapset through to be updated (rather than creating a new
> one and merging it), as it does for other types of clause.
>

I don't think it's really possible, because the AND/OR/NOT clause is
considered compatible only when all the pieces are compatible. So we
can't tweak the original bitmapset directly in case the incompatible
clause is not the very first one.

> On the other hand, the outer function statext_is_compatible_clause()
> does need to return a new bitmap, which may or may not be used by its
> caller, so it would be cleaner to make that a strictly "out" parameter
> and initialise it to NULL in that function, rather than in its caller.

On 3/10/19 2:09 PM, Dean Rasheed wrote:
> 15). As I said yesterday, I don't think that there is a clean
> separator of concerns between the functions clauselist_selectivity(),
> statext_clauselist_selectivity(),
> dependencies_clauselist_selectivity() and
> mcv_clauselist_selectivity(), I think things could be re-arranged as
> follows:
>
> statext_clauselist_selectivity() - as the name suggests - should take
> care of *all* extended stats estimation, not just MCVs and histograms.
> So make it a fairly small function, calling
> mcv_clauselist_selectivity() and
> dependencies_clauselist_selectivity(), and histograms when that gets
> added.
>
> Most of the current code in statext_clauselist_selectivity() is really
> MCV-specific, so move that to mcv_clauselist_selectivity(). Amongst
> other things, that would move the call to choose_best_statistics() to
> mcv_clauselist_selectivity() (just as
> dependencies_clauselist_selectivity() calls choose_best_statistics()
> to get the best dependencies statistics). Then, when histograms are
> added later, you won't have the problem pointed out before where it
> can't apply both MCV and histogram stats if they're on different
> STATISTICS objects.

I agree clauselist_selectivity() shouldn't care about various types of
extended statistics (MCV vs. functional dependencies). But I'm not sure
the approach you suggested (moving stuff to mcv_clauselist_selectivity)
would work particularly well because most of it is not specific to MCV
lists. It'll also need to care about histograms, for example.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2019-03-10 17:43:24 Re: monitoring CREATE INDEX [CONCURRENTLY]
Previous Message John Naylor 2019-03-10 17:11:18 Re: Why don't we have a small reserved OID range for patch revisions?