## Re: Problem with n_distinct being consistently inaccurate.

From: "Nick Fankhauser" "Tom Lane" "pgsql-admin" Re: Problem with n_distinct being consistently inaccurate. 2003-09-23 20:03:57 NEBBLAAHGLEEPCGOBHDGOEIOIJAA.nickf@ontko.com (view raw or flat) 2003-09-23 16:38:47 from "Nick Fankhauser"  2003-09-23 17:37:16 from Tom Lane   2003-09-23 20:03:57 from "Nick Fankhauser"    2003-09-23 20:07:33 from Tom Lane     2003-09-23 20:26:18 from "Nick Fankhauser"      2003-09-23 20:34:04 from Tom Lane       2003-09-23 21:05:58 from "Nick Fankhauser"        2003-09-23 23:43:50 from Tom Lane         2003-09-24 15:07:51 from "Nick Fankhauser" pgsql-admin
```> AFAIK, estimating number of distinct values from a small sample is
> inherently an ill-conditioned problem.

If I had been getting estimates all over the map, I'd have been a bit more
unconcerned, but what I'm seeing is a very consistent number that also
increases and tends to be more consistent in proportion to the stats
"target" number. This makes me think that there is more at work here than
the inaccuracy likely to occur from small samples. It's as if the algorithm
and sample size (even at default) are pretty reasonable for returning
consistent results in this case, but a multiplier needs to be changed.

For instance, with the various values for statistics, if I do an analyze on
the table and then look at n_distinct 6 times, these are the results I get:

(actual number is 92,000)

set statistics = -1 (default):
13549
14268
14772
14518
13863
13526

mean = 14083
std dev = 518 or 3.7% of mean

set statistics = 100
22457
22598
22566
22580
22767
22490

mean = 22576
std dev = 108 or .5% of mean

set statistics = 500
39878
39984
40018
39977
39885
40070

mean = 39968
std dev = 75 or .2% of mean

set statistics = 1000
51428
51503
51486
51658
51625
51589

mean = 51548
std dev = 74 or .1% of mean

> You could try sticking the correct n_distinct into pg_statistic by hand
> just to see if it really does change the plan

OK... but I'm a bit confused on how to get to the right row in pg_statistic.

when I do a \d on pg_stats, it appears that pg_statistic.starelid matches up
with pg_class.oid, but apparently this is not the case. Is there a place I
can look to find which keys correspond among the pg_catalog tables?

> but I'd like to think
> that getting within a factor of 2 is good enough.

Probably so... but with the default stats, it is more like a factor of 6,
which seems significant to me, and if my conjecture is correct, it might be
an easy fix. (Easy for me to say, since I'm not a developer. <grin>)

-Nick

```

### pgsql-admin by date

 Next: From: Tom Lane Date: 2003-09-23 20:07:33 Subject: Re: Problem with n_distinct being consistently inaccurate. Previous: From: scott.marlowe Date: 2003-09-23 19:58:32 Subject: Re: help needed!!!

Privacy Policy | About PostgreSQL
Copyright © 1996-2013 The PostgreSQL Global Development Group