Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-25 23:36:25
Message-ID: add2f0d3-4fbc-af7c-e02e-0f2b2d6dedc3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is an updated patch, fixing all the issues pointed out so far.
Unless there are some objections, I plan to commit the 0001 part by the
end of this CF. Part 0002 is a matter for PG13, as previously agreed.

On 3/24/19 1:17 AM, David Rowley wrote:
> On Sun, 24 Mar 2019 at 12:41, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On 3/21/19 4:05 PM, David Rowley wrote:
>>> 11. In get_mincount_for_mcv_list() it's probably better to have the
>>> numerical literals of 0.0 instead of just 0.
>>
>> Why?
>
> Isn't it what we do for float and double literals?
>

OK, fixed.

>>
>>> 12. I think it would be better if you modified build_attnums_array()
>>> to add an output argument that sets the size of the array. It seems
>>> that most places you call this function you perform bms_num_members()
>>> to determine the array size.
>>
>> Hmmm. I've done this, but I'm not sure I like it very much - there's no
>> protection the value passed in is the right one, so the array might be
>> allocated either too small or too large. I think it might be better to
>> make it work the other way, i.e. pass the value out instead.
>
> When I said "that sets the size", I meant "that gets set to the size",
> sorry for the confusion. I mean, if you're doing bms_num_members()
> inside build_attnums_array() anyway, then this will save you from
> having to do that in the callers too.
>

OK, I've done this now, and I'm fairly happy with it.

>>> 28. Can you explain what this is?
>>>
>>> uint32 type; /* type of MCV list (BASIC) */
>>>
>>> I see: #define STATS_MCV_TYPE_BASIC 1 /* basic MCV list type */
>>>
>>> but it's not really clear to me what else could exist. Maybe the
>>> "type" comment can explain there's only one type for now, but more
>>> might exist in the future?
>>>
>>
>> It's the same idea as for dependencies/mvdistinct stats, i.e.
>> essentially a version number for the data structure so that we can
>> perhaps introduce some improved version of the data structure in the future.
>>
>> But now that I think about it, it seems a bit pointless. We would only
>> do that in a major version anyway, and we don't keep statistics during
>> upgrades. So we could just as well introduce the version/flag/... if
>> needed. We can't do this for regular persistent data, but for stats it
>> does not matter.
>>
>> So I propose we just remove this thingy from both the existing stats and
>> this patch.
>
> I see. I wasn't aware that existed for the other types. It certainly
> gives some wiggle room if some mistakes were discovered after the
> release, but thinking about it, we could probably just change the
> "magic" number and add new code in that branch only to ignore the old
> magic number, perhaps with a WARNING to analyze the table again. The
> magic field seems sufficiently early in the struct that we could do
> that. In the master branch we'd just error if the magic number didn't
> match, since we wouldn't have to deal with stats generated by the
> previous version's bug.
>

OK. I've decided to keep the field for now, for sake of consistency with
the already existing statistic types. I think we can rethink that in the
future, if needed.

>>> 29. Looking at the tests I see you're testing that you get bad
>>> estimates without extended stats. That does not really seem like
>>> something that should be done in tests that are meant for extended
>>> statistics.
>>>
>>
>> True, it might be a bit unnecessary. Initially the tests were meant to
>> show old/new estimates for development purposes, but it might not be
>> appropriate for regression tests. I don't think it's a big issue, it's
>> not like it'd slow down the tests significantly. Opinions?
>
> My thoughts were that if someone did something to improve non-MV
> stats, then is it right for these tests to fail? What should the
> developer do in the case? update the expected result? remove the test?
> It's not so clear.
>

I think such changes would affect a number of other places in regression
tests (changing plans, ...), so I don't see why fixing these tests would
be any different.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-multivariate-MCV-lists-20190325.patch.gz application/gzip 39.9 KB
0002-multivariate-histograms-20190325.patch.gz application/gzip 48.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-03-25 23:39:55 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message David Rowley 2019-03-25 23:29:49 Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath