hist boundary duplicates bug in head and 8.3

From: "Nathan Boley" <npboley(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: hist boundary duplicates bug in head and 8.3
Date: 2009-01-06 01:15:46
Message-ID: 6fa3b6e20901051715p2a6b03dbt30ce14e9e2bc796c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For heavy tailed distributions, it is possible for analyze to
duplicate histogram boundaries.

Here is the output from a test against HEAD; I've attached the test data.

=# create table bug(f float);
COPY 100000
=# copy bug from '/tmp/test_data.txt';
COPY 100000
=# analyze bug;
ANALYZE
=# select histogram_bounds from pg_stats where tablename='bug';

histogram_bounds

------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------
--------

{34,34,34,34,34,34,34,36,36,36,36,36,37,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,41,41,41,41,42,42,42,43,43,44,44,44,45,45,45,46,46,46,47,47,48,48,48,4
9,49,50,50,51,51,52,53,53,54,55,55,56,57,58,58,59,60,61,62,63,64,65,67,68,70,72,74,76,79,81,83,86,89,91,94,98,101,105,110,117,122,131,139,152,171,202,236,30
8,1457}
(1 row)

Analyze assumes that if a value has a sample count less than
samplesize/num_buckets, ( maxmincount near line 2170 in
commands/analyze.c) then it is safe to include it in a histogram.
However, because the histogram only contains non mcv's, this is
incorrect.

As far as I can see, there are 4 solutions:

1) track all of the distinct values
This wouldn't be *too* expensive in analyze, especially considering we
are tracking all of the sampled values as it is. However, this opens
up the possibility of having huge mcv's lists in the worst case.To see
this, consider a distribution such that the most common value was 20%
of the table, the next mcv was 20% of the remaining entries, etc.
Clearly, for stats targets greater than 5, every value in the table
would overrun a histogram boundary, leading to an mcv list that
contained every distinct value in the sample.

2) reduce number_of_bins if values exist with frequency greater than 1/nbins
This would fix the bug, but at the cost of reducing the utility of the
histogram ( it would introduce a large skew to the ndistinct
distribution, which is assumed to be uniform over non-mcvs ).

3) use variable width histogram bins over all values.
This is probably the cleanest solution, but the most invasive.

4) Fix the binary search in ineqsel to correctly find the boundaries,
even with duplicates
This would also be relatively clean, but are the hist boundaries
assumption of being strictly increasing being satisfied anywhere else
besides ineqsel?

I've attached a patch that is a compromise between 1 and 2. It puts a
hard limit on the number of mcv's at 2x the stats target, and then, if
there are still values with too high a frequency, it reduces the
number of histogram buckets.

-Nathan

Attachment Content-Type Size
test_data.txt.gz application/x-gzip 91.1 KB
hist_bndry_bug_fix.patch text/x-patch 3.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2009-01-06 01:22:14 Re: Time to finalize patches for 8.4 beta
Previous Message Josh Berkus 2009-01-06 01:12:27 Re: Time to finalize patches for 8.4 beta