Re: Improving N-Distinct estimation by ANALYZE

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-05 00:07:37
Message-ID: 200601041607.37590.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon,

> - Are there any performance issues that can be directly attributed to
> mis-estimation of N-Distinct ("D") by the ANALYZE command?

Yes. There's at least one query (maybe two) from TPC-H which bombs
because of bad N-distinct estimation, even with stats_target =1000. Based
on my experience with data warehouses, this also occurs in the field.

> - If so, can we do better than we currently achieve? How?

Replace the current algorithm and broaden the sample.

> - Would altering the estimate of D cause problems in other places?

Unlike Index Cost Estimation, I wouldn't expect it to. We make pretty
"proper" use of D right now, it's just that for some common cases our
estimates of D are bad.

> The estimation of D is difficult and imprecise. The current method works
> well in many cases, yet breaks down *badly* in one particular very
> common use case of database design: a large dependent table with a
> multi-column Primary Key. In some cases the estimate of D *decreases* as
> the size of the table increases, though the estimate of D is an
> underestimate in almost all cases, whatever the table design.

Actually, the current estimator underestimates D for *any* large table's
high-cardinality columns, primary key, multi-column, or not.
Chaudhuri's calculation seems to be designed to yield the smallest
number of cardinal values that could reasonably be expected to yield the
provided sample.   That is, if the estimate range within a stdev of 2.0
gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000.

This conservative approach makes sense when you're only considering join
strategies.  That is, given an unreliable estimate you want to estimate
D low so that you don't wrongly choose a nested loop, the cost for which
mistake being much higher than the cost of performing an unnecessary
hash join.   It's "conservative" in that sense.

However,   PostgreSQL now has a whole set of hash operations and other
query types for which a
too-low estimate of D causes query lockup.   So for these operations,
Chaudhuri ceases to be conservative and becomes high-risk.   FWIW, my
testing with TPCH showed that estimate error is usually OK within +/-
5x.  Beyond that any you start to get bad query plans.

(yes, I know all of the above begs examples. I'm swamped. I believe I
posted examples when I first started talking about n-distinct estimation a
year ago)

So I think it's vital that we look at algorithms designed to deliver us
the median estimated D, not the lowest reasonable, in addition to
increasing sample size.  The block-based estimator functions which
Andrew and I looked at seem designed to do that provided a sample of
between 1% and 10%.

> 1. *All* methods of statistical analysis are improved by larger sample
> fractions. The D estimator method currently in use shows an optimum of
> accuracy and sample fraction at around 5% of a table, as shown in the
> author's original paper [Haas Stokes (1998)]. The current
> implementation's error rates climb higher as table size increases.

I read 5 different papers on ACM about sampling D.  All of them were
united in saying that you couldn't get even slightly accurate estimates
with less than 3% sampling.

> 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
> row sampling technique rather than a block sampling technique. This
> would translate directly into a performance drop from large sample
> ratios, but since we currently use a fixed sample size this problem is
> not yet visible for larger tables. With a 2GB table, we would typically
> sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.

This woudl be a reason to use block-sampling ONLY, rather than hybrid
sampling.

> 3. Large values of statistics target (up to 1000) could cause a number
> of problems with statistics catalog table growth (mentioned on
> -perform). Setting these values appropriately can take significant
> effort. Automatic scaling of such parameters is desirable.

Well, decoupling the n-distinct sample size from the # of heuristics rows
would fix that.

> 4. ANALYZE doesn't use more memory if maintenance_work_mem is set high,
> nor does it use less if it is set low. Actual memory usage isn't
> measured at all. With very long rows this is max 24 MB with default
> stats target settings and BLCKSZ, or 2.4 GB with highest stats target
> (1000). This probably is of lower importance since stats targets are
> only usually set higher on larger databases, which typically have larger
> memory configurations anyway - and very long rows are uncommon because
> of TOAST. Typical memory usage by ANALYZE would be < 1 MB with default
> settings i.e. maybe as low as a 0.01% sample for a very large table.

Yeah, I think I pointed out that ANALYZE doesn't seem to be using
maintenance_mem a year ago. It should, although there should be options
to use less than maintenance_mem if the user wants low-impact analyze.

> There is a further effect of concern here. We currently apply a "lift"
> to the D estimate when we think that the number of distinct values is
> high enough to imply that it should be scaled according to N. In the
> above example, when sample ratio is too small, D will hit the point
> where it is too low to be scaled and we suddenly "bomb out" to a much
> lower value.

Yes, this is another problem with the current algorithm. This kind of
"thresholding" is, well, hackish. More importantly, it leads to
unpredictable query behavior as a query on a table only 10 rows larger
yeilds a radically different plan at the edge of the threshold.

> The test results don't seem too bad if you view the estimate of D as at
> most a factor of 10 wrong. However, since the error scales up with the
> size of the table, we can imagine very large estimation errors.

Yes.  My tests showed that for a tpch of 100G, with 600 million rows in
Lineitem, D was an average of 30x low and could not be less than 10x low
even with the luckiest sample.  This misestimate gets worse as the table
gets larger.

> Chaudhuri's estimator is based on a least risk approach, rather than a
> greatest accuracy approach, which does sound appealing should we not be
> able to apply an improved estimator.

As I point out above, though, Chaudhuri's understanding of "least risk"
is flawed.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2006-01-05 00:22:05 Re: Improving N-Distinct estimation by ANALYZE
Previous Message Simon Riggs 2006-01-05 00:05:00 Re: Improving N-Distinct estimation by ANALYZE