Re: Use extended statistics to estimate (Var op Var) clauses

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2021-08-18 10:43:05
Message-ID: 69ce226b-cfbb-2712-bfd7-ec62cf0e0b2a@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Mark,

This thread inspired me to do something fairly similar - a generator
that generates queries of varying complexity, executes them on table
with and without extended statistics. I've been thinking about that
before, but this finally pushed me to do that, and some of the results
are fairly interesting ...

I've pushed everything (generator and results) to this github repo:

https://github.com/tvondra/stats-test

with a summary of all results here:

https://github.com/tvondra/stats-test/blob/master/results.md

Some basic facts about the generator.py (see query_generator):

* It's using a fixed seed to make it deterministic.

* A small fraction of generated queries is sampled and executed (5%).

* Thanks to a fixed seed we generate/sample the same set of queries for
different runs, which allows us to compare runs easily.

* The queries use 2 - 5 clauses, either (Var op Const) or (Var op Var).

* The operators are the usual equality/inequality ones.

* The clauses are combined using AND/OR (also randomly picked).

* There's a random set of parens added, to vary the operator precedence
(otherwise it'd be driven entirely by AND/OR).

* There are two datasets - a random and correlated one, with different
number of distinct values in each column (10, 100, 1000, 10000).

* The statistics target is set to 10, 100, 1000, 10000.

It's a bit hacky, with various bits hard-coded at the moment. But it
could be extended to do other stuff fairly easily, I think.

Anyway, the repository contains results for three cases:

1) master
2) patched: master with the (Var op Var) patch
3) fixed: patched, with a fix for "simple" clauses (a crude patch)

And for each case we have three row counts:

* actual (from explain analyze)
* estimate without extended stats
* estimate with extended stats

And then we can calculate "estimation error" as

estimate / actual

both with and without statistics. Results for two cases can be plotted
as a scatter plot, with the two estimation errors as (x,y) values. The
idea is that this shows how a patch affects estimates - a point (100,10)
means that it was 100x over-estimated, and with the patch it's just 10x,
and similarly for other points.

This is what the charts at

https://github.com/tvondra/stats-test/blob/master/results.md

do, for each combination of parameters (dataset, statistics target and
number of distinct values). There's one chart without extended stats,
one with extended stats.

An "ideal" chart would look like like a single point (1,1) which means
"accurate estimates without/with patch", or (?,1) which means "poor
estimates before, accurate estimates now". Diagonal means "no change".

In principle, we expect the charts to look like this:

* random: diagonal charts, because there should be no extended stats
built, hence no impact on estimates is expected

* correlated: getting closer to 1.0, which looks like a horizontal line
in the chart

Consider for example this:

https://github.com/tvondra/stats-test/raw/master/correlated-1000-10.png

which clearly shows that the first patch is almost exactly the same as
master, while with the fix the estimates improve significantly (and are
almost perfect), at least with the statistics.

Without stats there's a bunch of queries that suddenly get from
"perfect" to much worse (looks like a vertical line on the left chart).

But there are other "strange" cases with "interesting patterns", like
for example

*
https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-100-100.png

*
https://raw.githubusercontent.com/tvondra/stats-test/master/correlated-1000-100.png

*
https://raw.githubusercontent.com/tvondra/stats-test/master/random-10000-10.png

This likely shows the patches are a significant improvement for some
queries (either getting better than master, or even making the estimates
pretty accurate). But it's probably worth looking into the queries that
got worse, etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-patch-20210720.patch text/x-patch 25.6 KB
0002-fix-don-t-treat-Var-op-Var-as-simple-clauses.patch text/x-patch 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-08-18 10:53:31 Re: Fix uninitialized variable access (src/backend/utils/mmgr/freepage.c)
Previous Message David Rowley 2021-08-18 10:39:36 Re: Window Function "Run Conditions"