Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-22 04:10:00
Message-ID: CAKJS1f9jKkjs5NB3MwOMYsGzWCX6ptsOX2NDHMhJ7oJfbfxEqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for making those changes.

On Fri, 18 Jan 2019 at 10:03, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> A couple of items remains:
>
> > 15. I see you broke out the remainder of the code from
> > clauselist_selectivity() into clauselist_selectivity_simple(). The
> > comment looks like just a copy and paste from the original. That
> > seems like quite a bit of duplication. Is it better to maybe trim down
> > the original one?
>
> I don't follow - where do you see the code duplication? Essentially, we
> have clauselist_selectivity and clauselist_selectivity_simple, but the
> first one calls the second one. The "simple" version is needed because
> in some cases we need to perform estimation without multivariate stats
> (e.g. to prevent infinite loop due to recursion).

It was the comment duplication that I was complaining about. I think
clauselist_selectivity()'s comment can be simplified to mention it
attempts to apply extended statistics and applies
clauselist_selectivity_simple on any stats that remain. Plus any
details that are specific to extended statistics. That way if anyone
wants further detail on what happens to the remaining clauses they can
look at the comment above clauselist_selectivity_simple().

> > 18. In dependencies_clauselist_selectivity() there seem to be a new
> > bug introduced. We do:
> >
> > /* mark this one as done, so we don't touch it again. */
> > *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
> >
> > but the bms_is_member() check that skipped these has been removed.
> >
> > It might be easier to document if we just always do:
> >
> > if (bms_is_member(listidx, *estimatedclauses))
> > continue;
> >
> > at the start of both loops. list_attnums can just be left unset for
> > the originally already estimatedclauses.
>
> This was already discussed - I don't think there's any bug, but I'll
> look into refactoring the code somehow to make it clear.

On looking at this a bit more it seems that since the estimated attr
is removed from the clauses_attnums Bitmapset that
find_strongest_dependency() will no longer find a dependency for that
clause and dependency_implies_attribute() will just return false where
the bms_is_member(listidx, *estimatedclauses) would have done
previously. I'll mean we could get more calls of
dependency_implies_attribute(), but that function is even cheaper than
bms_is_member() so I guess there's no harm in this change.

> > 25. Does statext_is_compatible_clause_internal)_ need to skip over
> > RelabelTypes?
>
> I don't think it should, because while RelabelType nodes represent casts
> to binary-compatible types, there's no guarantee the semantics actually
> is compatible.

The code that looks through RelabelTypes for normal stats is in
examine_variable(). This code allows the following to estimate 4 rows.
I guess if we didn't use that then we'd just need to treat it like
some unknown expression and use DEFAULT_NUM_DISTINCT.

create table a (t varchar);
insert into a select v.v from (values('One'),('Two'),('Three')) as
v(v), generate_Series(1,4);
analyze a;
explain (summary off, timing off, analyze) select * from a where t = 'One';
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on a (cost=0.00..1.15 rows=4 width=4) (actual rows=4 loops=1)
Filter: ((t)::text = 'One'::text)
Rows Removed by Filter: 8
(3 rows)

Why do you think its okay for the normal stats to look through
RelabelTypes but not the new code you're adding?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-01-22 04:29:43 Re: problems with foreign keys on partitioned tables
Previous Message Hubert Zhang 2019-01-22 04:08:46 Re: Control your disk usage in PG: Introduction to Disk Quota Extension