Skip site navigation (1) Skip section navigation (2)

Re: Improving N-Distinct estimation by ANALYZE

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-05 00:07:37
Message-ID: 200601041607.37590.josh@agliodbs.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Simon,

> - Are there any performance issues that can be directly attributed to
> mis-estimation of N-Distinct ("D") by the ANALYZE command?

Yes.   There's at least one query (maybe two) from TPC-H which bombs 
because of bad N-distinct estimation, even with stats_target =1000.  Based 
on my experience with data warehouses, this also occurs in the field.

> - If so, can we do better than we currently achieve? How?

Replace the current algorithm and broaden the sample.

> - Would altering the estimate of D cause problems in other places?

Unlike Index Cost Estimation, I wouldn't expect it to.  We make pretty 
"proper" use of D right now, it's just that for some common cases our 
estimates of D are bad.

> The estimation of D is difficult and imprecise. The current method works
> well in many cases, yet breaks down *badly* in one particular very
> common use case of database design: a large dependent table with a
> multi-column Primary Key. In some cases the estimate of D *decreases* as
> the size of the table increases, though the estimate of D is an
> underestimate in almost all cases, whatever the table design.

Actually, the current estimator underestimates D for *any* large table's 
high-cardinality columns, primary key, multi-column, or not. 
Chaudhuri's calculation seems to be designed to yield the smallest 
number of cardinal values that could reasonably be expected to yield the 
provided sample.   That is, if the estimate range within a stdev of 2.0 
gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000.

This conservative approach makes sense when you're only considering join 
strategies.  That is, given an unreliable estimate you want to estimate 
D low so that you don't wrongly choose a nested loop, the cost for which 
mistake being much higher than the cost of performing an unnecessary 
hash join.   It's "conservative" in that sense.

However,   PostgreSQL now has a whole set of hash operations and other 
query types for which a 
too-low estimate of D causes query lockup.   So for these operations, 
Chaudhuri ceases to be conservative and becomes high-risk.   FWIW, my 
testing with TPCH showed that estimate error is usually OK within +/- 
5x.  Beyond that any you start to get bad query plans.

(yes, I know all of the above begs examples.  I'm swamped.   I believe I 
posted examples when I first started talking about n-distinct estimation a 
year ago)

So I think it's vital that we look at algorithms designed to deliver us 
the median estimated D, not the lowest reasonable, in addition to 
increasing sample size.  The block-based estimator functions which 
Andrew and I looked at seem designed to do that provided a sample of 
between 1% and 10%.

> 1. *All* methods of statistical analysis are improved by larger sample
> fractions. The D estimator method currently in use shows an optimum of
> accuracy and sample fraction at around 5% of a table, as shown in the
> author's original paper [Haas Stokes (1998)]. The current
> implementation's error rates climb higher as table size increases.

I read 5 different papers on ACM about sampling D.  All of them were 
united in saying that you couldn't get even slightly accurate estimates 
with less than 3% sampling.

> 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
> row sampling technique rather than a block sampling technique. This
> would translate directly into a performance drop from large sample
> ratios, but since we currently use a fixed sample size this problem is
> not yet visible for larger tables. With a 2GB table, we would typically
> sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

This woudl be a reason to use block-sampling ONLY, rather than hybrid 
sampling.

> 3. Large values of statistics target (up to 1000) could cause a number
> of problems with statistics catalog table growth (mentioned on
> -perform). Setting these values appropriately can take significant
> effort. Automatic scaling of such parameters is desirable.

Well, decoupling the n-distinct sample size from the # of heuristics rows 
would fix that.

> 4. ANALYZE doesn't use more memory if maintenance_work_mem is set high,
> nor does it use less if it is set low. Actual memory usage isn't
> measured at all. With very long rows this is max 24 MB with default
> stats target settings and BLCKSZ, or 2.4 GB with highest stats target
> (1000). This probably is of lower importance since stats targets are
> only usually set higher on larger databases, which typically have larger
> memory configurations anyway - and very long rows are uncommon because
> of TOAST. Typical memory usage by ANALYZE would be < 1 MB with default
> settings i.e. maybe as low as a 0.01% sample for a very large table.

Yeah, I think I pointed out that ANALYZE doesn't seem to be using 
maintenance_mem a year ago.   It should, although there should be options 
to use less than maintenance_mem if the user wants low-impact analyze.

> There is a further effect of concern here. We currently apply a "lift"
> to the D estimate when we think that the number of distinct values is
> high enough to imply that it should be scaled according to N. In the
> above example, when sample ratio is too small, D will hit the point
> where it is too low to be scaled and we suddenly "bomb out" to a much
> lower value.

Yes, this is another problem with the current algorithm.  This kind of 
"thresholding" is, well, hackish.  More importantly, it leads to 
unpredictable query behavior as a query on a table only 10 rows larger 
yeilds a radically different plan at the edge of the threshold.

> The test results don't seem too bad if you view the estimate of D as at
> most a factor of 10 wrong. However, since the error scales up with the
> size of the table, we can imagine very large estimation errors.

Yes.  My tests showed that for a tpch of 100G, with 600 million rows in 
Lineitem, D was an average of 30x low and could not be less than 10x low 
even with the luckiest sample.  This misestimate gets worse as the table 
gets larger.

> Chaudhuri's estimator is based on a least risk approach, rather than a
> greatest accuracy approach, which does sound appealing should we not be
> able to apply an improved estimator.

As I point out above, though, Chaudhuri's understanding of "least risk" 
is flawed.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

In response to

Responses

pgsql-hackers by date

Next:From: Greg StarkDate: 2006-01-05 00:22:05
Subject: Re: Improving N-Distinct estimation by ANALYZE
Previous:From: Simon RiggsDate: 2006-01-05 00:05:00
Subject: Re: Improving N-Distinct estimation by ANALYZE

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group