Re: number of rows in analyze

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Willy-Bas Loos <willybas(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: number of rows in analyze
Date: 2011-08-01 14:49:59
Message-ID: 23867.1312210199@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

Willy-Bas Loos <willybas(at)gmail(dot)com> writes:
> "The largest statistics target among the columns being analyzed
> determines the number of table rows sampled to prepare the
> statistics."
> (http://www.postgresql.org/docs/8.4/interactive/sql-analyze.html)

> My question is *HOW* does it "detirmine the number of table rows
> sampled" with the statistics target?
> What is the formula behind it?

>From src/backend/commands/analyze.c:

/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;

> I am not familiar enough with the source code to figure that out
> myself. sry about that.
> Would be good to include in the docs if we find out.

Why? This is purely an implementation detail. From the user's
viewpoint, either the stats are good enough or they're not --- the exact
number of rows sampled doesn't seem that interesting.

regards, tom lane

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Willy-Bas Loos 2011-08-01 16:02:34 Re: number of rows in analyze
Previous Message ktm@rice.edu 2011-08-01 14:49:12 Re: number of rows in analyze

Browse pgsql-general by date

  From Date Subject
Next Message Jaime Casanova 2011-08-01 15:05:49 Re: repmgr problem with registering standby
Previous Message ktm@rice.edu 2011-08-01 14:49:12 Re: number of rows in analyze