Quick Links

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date:	2018-06-24 19:45:19
Message-ID:	f52637bd-f761-f91b-c8ac-85e0c159807f@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi all,

Attached is a rebased version of this patch series, mostly just fixing
the breakage caused by reworked format of initial catalog data.

Aside from that, the MCV building now adopts the logic introduced by
commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
pretty good and I don't see why multi-column MCV lists should use
something special.

I'm sure there are plenty of open questions to discuss, particularly
stuff related to combining the various types of statistics to the final
estimate (a lot of that was already improved based on Dean's reviews).

On thing that occurred to me while comparing the single-column logic (as
implemented in selfuncs.c) and the new multi-column stuff, is dealing
with partially-matching histogram buckets.

In the single-column case, we pretty much assume uniform distribution in
each bucket, and linearly interpolate the selectivity. So for a bucket
with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x <
7" we return 0.7 and so on. This is what convert_to_scalar() does.

In the multi-column case, we simply count each matching bucket as 0.5,
without any attempts to linearly interpolate. It would not be difficult
to call "convert_to_scalar" for each condition (essentially repeating
the linear interpolation for each column), but then what? We could
simply compute a product of those results, of course, but that only
works assuming independence. And that's exactly the wrong thing to
assume here, considering the extended statistics are meant for cases
where the columns are not independent.

So I'd argue the 0.5 estimate for partially-matching buckets is the
right thing to do here, as it's minimizing the average error.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
0001-multivariate-MCV-lists.patch.gz	application/gzip	39.9 KB
0002-multivariate-histograms.patch.gz	application/gzip	43.0 KB

In response to

Re: [HACKERS] PATCH: multivariate histograms and MCV lists at 2018-04-07 20:44:42 from Tomas Vondra

Responses

Re: [HACKERS] PATCH: multivariate histograms and MCV lists at 2018-07-13 11:19:18 from Dean Rasheed

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fabien COELHO	2018-06-24 19:59:52	Re: Desirability of client-side expressions in psql?
Previous Message	Peter Geoghegan	2018-06-24 19:28:01	Removing obsolete comment block at the top of nbtsort.c.