|From:||Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>|
|To:||Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>|
|Cc:||Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: [HACKERS] PATCH: multivariate histograms and MCV lists|
|Views:||Raw Message | Whole Thread | Download mbox|
Attached is a rebased version of this patch series, mostly just fixing
the breakage caused by reworked format of initial catalog data.
Aside from that, the MCV building now adopts the logic introduced by
commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
pretty good and I don't see why multi-column MCV lists should use
I'm sure there are plenty of open questions to discuss, particularly
stuff related to combining the various types of statistics to the final
estimate (a lot of that was already improved based on Dean's reviews).
On thing that occurred to me while comparing the single-column logic (as
implemented in selfuncs.c) and the new multi-column stuff, is dealing
with partially-matching histogram buckets.
In the single-column case, we pretty much assume uniform distribution in
each bucket, and linearly interpolate the selectivity. So for a bucket
with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x <
7" we return 0.7 and so on. This is what convert_to_scalar() does.
In the multi-column case, we simply count each matching bucket as 0.5,
without any attempts to linearly interpolate. It would not be difficult
to call "convert_to_scalar" for each condition (essentially repeating
the linear interpolation for each column), but then what? We could
simply compute a product of those results, of course, but that only
works assuming independence. And that's exactly the wrong thing to
assume here, considering the extended statistics are meant for cases
where the columns are not independent.
So I'd argue the 0.5 estimate for partially-matching buckets is the
right thing to do here, as it's minimizing the average error.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
|Next Message||Fabien COELHO||2018-06-24 19:59:52||Re: Desirability of client-side expressions in psql?|
|Previous Message||Peter Geoghegan||2018-06-24 19:28:01||Removing obsolete comment block at the top of nbtsort.c.|