Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2018-03-26 13:08:43
Message-ID: 387341b0-e4bb-e27d-8c68-afaef8189c15@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/26/2018 12:31 PM, Dean Rasheed wrote:
> On 18 March 2018 at 23:57, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Attached is an updated version of the patch series, addressing issues
>> pointed out by Alvaro.
>
> I'm just starting to look at this now, and I think I'll post
> individual comments/questions as I get to them rather than trying to
> review the whole thing, because it's quite a large patch. Apologies
> if some of this has already been discussed.
>

Sure, works for me. And thanks for looking!

> Looking at the changes to UpdateStatisticsForTypeChange():
>
> + memset(nulls, 1, Natts_pg_statistic_ext * sizeof(bool));
>
> why the "1" there -- is it just a typo?
>

Yeah, that should be 0. It's not causing any issues, because the
"replaces" array is initialized to 0 so we're not really using the null
value except for individual entries like here:

if (statext_is_kind_built(oldtup, STATS_EXT_MCV))
{
replaces[Anum_pg_statistic_ext_stxmcv - 1] = true;
nulls[Anum_pg_statistic_ext_stxmcv - 1] = true;
}

but that sets the "nulls" to true anyway.

> A wider concern I have is that I think this function is trying to be
> too clever by only resetting selected stats. IMO it should just reset
> all stats unconditionally when the column type changes, which would
> be consistent with what we do for regular stats.
>
> Consider, for example, what would happen if a column was changed from
> real to int -- all the data values will be coerced to integers,
> losing precision, and any ndistinct and dependency stats would
> likely be completely wrong afterwards. IMO that's a bug, and should
> be back-patched independently of these new types of extended stats.
>
> Thoughts?
>

The argument a year ago was that it's more plausible that the semantics
remains the same. I think the question is how the type change affects
precision - had the type change in the opposite direction (int to real)
there would be no problem, because both ndistinct and dependencies would
produce the same statistics.

In my experience people are far more likely to change data types in a
way that preserves precision, so I think the current behavior is OK.

The other reason is that when reducing precision, it generally enforces
the dependency (you can't violate functional dependencies or break
grouping by merging values). So you will have stale stats with weaker
dependencies, but it's still better than not having any.

But that's mostly unrelated to this patch, of course - for MCV lists and
histograms we can't keep the stats anyway, because the stats actually do
contain the type values (unlike stats introduced in PG10).

Actually, to be more accurate - we now store OIDs of the data types in
the MCV/histogram stats, so perhaps we could keep those too. But that
would be way more work (if at all possible).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-03-26 13:20:09 Re: Parallel Aggregates for string_agg and array_agg
Previous Message Damir Simunic 2018-03-26 13:05:25 Re: Proposal: http2 wire format