Re: MCV lists for highly skewed distributions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, John Naylor <jcnaylor(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-03-16 15:26:51
Message-ID: 8a524773-3299-fe7b-de8b-bc977de97bec@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/11/2018 10:23 AM, Dean Rasheed wrote:
> ...
>
> I'm moving this back to a status of "Needs review" partly because the
> code has changed significantly, but also because I want to do more
> testing, particularly with larger datasets.
>

Thanks for working on this. The code seems fine to me, although it might
be a good idea to add comments explaining why analyze_mcv_list() starts
with full MCV lists and then removes items (essentially the explanation
why Jeff's original idea would not work so well).

Actually, one question - when deciding whether to keep the item in the
MCV list, analyze_mcv_list only compares it's frequency with an average
of the rest. But as we're removing items from the MCV list, the average
frequency of the non-MCV items is growing (we're removing items with
higher and higher frequencies). That means the estimates for the least
common items will get higher and higher - essentially overestimates. So,
couldn't/shouldn't analyze_mcv_list consider this too?

I've also done a bunch of testing regarding join cardinality estimates,
because eqjoinsel_inner() is one of the places using MCV lists too. So
I've generated tables with different sizes and data distributions, and
observed how the patch affects the join estimates.

The scripts and results are available here:

https://github.com/tvondra/join-estimates-tests

The tables were not particularly huge at this point (1000 to 100k rows),
I've also varied number of distinct values (100 - 10k), statistics
target (10 - 10k) and distribution (for each table independently):

1) uniform

insert into t1 select random() * $ndistinct1, i
from generate_series(1,$nrows1) s(i)

2) skewed

insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)

3) skewed-inverse

insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)

4) skewed

insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)

5) skewed-strong-inverse

insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)

The results are a bit too large to attach them here - see for example
https://github.com/tvondra/join-estimates-tests/blob/master/bench/summary.ods.

Looking at Mean Relative Error, i.e. essentially

MRE = AVERAGE(MAX(estimate/actual,actual/estimate))

over the 20 runs for each combination of parameters, The "average" sheet
shows

(MRE for patched) / (MRE for master)

and the patch seems to clearly improve accuracy in this case. There are
a few cases where the estimate gets slightly worse (say, by 10%)
compared to current master. So I think that's nice.

I'm open to running additional tests with other distributions, table
sizes etc. if needed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-16 15:40:21 Re: segmentation fault in pg head with SQL function.
Previous Message Alvaro Herrera 2018-03-16 15:13:03 Re: ON CONFLICT DO UPDATE for partitioned tables