From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |

Cc: | Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |

Subject: | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |

Date: | 2018-09-02 23:17:15 |

Message-ID: | 8ac8bd94-478d-215d-e6bd-339f1f20a74c@2ndquadrant.com |

Views: | Raw Message | Whole Thread | Download mbox |

Thread: | |

Lists: | pgsql-hackers |

Hi,

Attached is an updated version of the patch series, adopting a couple of

improvements - both for MCV lists and histograms.

MCV

---

For the MCV list part, I've adopted the approach proposed by Dean, using

base selectivity and using it to correct the non-MCV part. I agree the

simplicity of the approach is a nice feature, and it seems to produce

better estimates. I'm not sure I understand the approach perfectly, but

I've tried to add comments explaining how it works etc.

I've also changed how we build the MCV lists, particularly how we decide

how many / which items store in the MCV list. In the previous version

I've adopted the same algorithm we use for per-column MCV lists, but in

certain cases that turned out to be too restrictive.

Consider for example a table with multiple perfectly correlated columns,

with very few combinations. That is, something like this:

CREATE TABLE t (a int, b int);

INSERT INTO t SELECT mod(i,50), mod(i,50)

FROM generate_series(1,1e6) s(i);

CREATE STATISTICS s (mcv) ON a,b FROM t;

Now, the data distribution is very simple - uniform, with 50 distinct

combinations, each representing 2% of data (and the random sample should

be pretty close to that).

In these cases, analyze_mcv_list decides it does not need any MCV list,

because the frequency for each value is pretty much 1/ndistinct. For

single column that's reasonable, but for multiple correlated columns

it's rather problematic. We might use the same ndistinct approach

(assuming we have the ndistinct coefficients), but that still does not

allow us to decide which combinations are "valid" with respect to the

data. For example we can't decide (1,10) does not appear in the data.

So I'm not entirely sure adopting the same algorithm analyze_mcv_list

algorithm both for single-column and multi-column stats. It may make

sense to keep more items in the multi-column case for reasons that are

not really valid for a single single-column.

For now I've added a trivial condition to simply keep all the groups

when possible. This probably needs more thought.

BTW Dean's patch also modified how the maximum number of items on a MCV

list is determined - instead of the shaky defaults I used before, it

derives the size from attstattarget values for the columns, keeping the

maximum value. That seems to make sense, so I've kept this.

histograms

----------

For histograms, I've made the two improvements I mentioned previously.

Firstly, simple equality conditions (of the form "var = const") are

estimated using as 1/ndistinct (possibly using ndistinct coefficients

when available), and then used only as "conditions" (in the "conditional

probability" sense) when estimating the rest of the clauses using the

histogram.

That is P(clauses) is split into two parts

P(clauses) = P(equalities) * P(remaining|clauses)

where the first part is estimated as 1/ndistinct, the second part is

estimated using histogram.

I'm sure this needs more thought, particularly when combining MCV and

histogram estimates. But in general it seems to work quite nicely.

The second improvement is about estimating what fraction of a bucket

matches the conditions. Instead of using the rough 1/2-bucket estimate,

I've adopted the convert_to_scalar approach, computing a geometric mean

for all the clauses (at a bucket level).

I'm not entirely sure the geometric mean is the right approach (or

better than simply using 1/2 the bucket) because multiplying the

per-clause frequencies is mostly equal to assumption of independence at

the bucket level. Which is rather incompatible with the purpose of

multi-column statistics, which are meant to be used exactly when the

columns are not independent.

measurements

------------

I think we need to maintain a set of tests (dataset + query), so that we

can compare impact of various changes in the algorithm. So far we've

used mostly ad-hoc queries, often created as counter-examples, and that

does not seem very practical.

So I'm attaching a simple SQL script that I consider an initial version

of that. It has a couple of synthetic data sets, and queries estimated

with and without extended statistics.

I'm also attaching a spreadsheet with results for (a) the original

version of the patch series, as submitted on 6/24, (b) the new version

attached here and (c) the new version using the per-bucket estimates

directly, without the geometric mean.

Overall, the new versions seem to perform better than the version from

6/24, and also compared to only per-column statistics. There are cases

where extended statistic produce over-estimates, but I find it somewhat

natural due to lower resolution of the multi-column stats.

regards

--

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment | Content-Type | Size |
---|---|---|

stats.ods | application/vnd.oasis.opendocument.spreadsheet | 30.1 KB |

stats-tests.sql | application/sql | 17.7 KB |

0002-multivariate-histograms-20180902.patch | text/x-patch | 175.1 KB |

0001-multivariate-MCV-lists-20180902.patch | text/x-patch | 148.0 KB |

- Re: [HACKERS] PATCH: multivariate histograms and MCV lists at 2018-08-06 14:34:10 from Tomas Vondra

- Re: [HACKERS] PATCH: multivariate histograms and MCV lists at 2018-09-04 14:16:28 from Dean Rasheed
- Re: [HACKERS] PATCH: multivariate histograms and MCV lists at 2018-11-26 22:29:17 from Thomas Munro

From | Date | Subject | |
---|---|---|---|

Next Message | Tomas Vondra | 2018-09-03 00:35:00 | Re: memory leak when serializing TRUNCATE in reorderbuffer |

Previous Message | Tom Lane | 2018-09-02 22:51:52 | Re: libpq debug log |