--- perform.sgml.orig Sat Feb 5 12:45:36 2005 +++ perform.sgml Tue Feb 8 15:08:02 2005 @@ -470,6 +470,288 @@ + + + How the Planner Uses Statistics + + + statistics + of the planner + + + + This section builds on the material covered in the previous two and + shows how the planner uses the system statistics to estimate the number of + rows each stage of a query might return. We will adopt the approach of + showing by example, which should provide a good feel for how this works. + + + + Continuing with the examples drawn from the regression test + database (and 8.0 sources), let's start with a simple query which has + one restriction in its WHERE clause : + + +EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000; + + QUERY PLAN +------------------------------------------------------------ + Seq Scan on tenk1 (cost=0.00..470.00 rows=1031 width=244) + Filter: (unique1 < 1000) + + + + The planner examines the WHERE clause condition: + + +unique1 < 1000 + + + and looks up the restriction function for the operator + < in pg_operator. + This is held in the column oprrest, + and the result in this case is scalarltsel. + The scalarltsel function retrieves the histogram for + unique1 from pg_statistics + - we can follow this by using the simpler pg_stats + view: + + +SELECT histogram_bounds FROM pg_stats +WHERE tablename='tenk1' AND attname='unique1'; + + histogram_bounds +------------------------------------------------------ + {1,970,1943,2958,3971,5069,6028,7007,7919,8982,9995} + + + Next the fraction of the histogram occupied by < 1000 + is worked out. This is the selectivity. The histogram divides the range + into equal frequency buckets, so all we have to do is locate the bucket + that our value is in and count part of it and + all of the ones before. The value 1000 is clearly in + the second (970 - 1943) bucket, so by assuming a linear distribution of + values inside each bucket we can calculate the selectivity as: + + +selectivity = (1 + (1000 - 970)/(1943 - 970)) / 10 + = 0.1031 + + + that is, one whole bucket plus a linear fraction of the second, divided by + the number of buckets. The estimated number of rows can now be calculated as + the product of the selectivity and the cardinality of + tenk1 : + + +rows = 10000 * 0.1031 + = 1031 + + + + + + Next let's consider an example with a WHERE clause using + the = operator : + + +EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'ATAAAA'; + + QUERY PLAN +---------------------------------------------------------- + Seq Scan on tenk1 (cost=0.00..470.00 rows=31 width=244) + Filter: (stringu1 = 'ATAAAA'::name) + + + Again the planner examines the WHERE clause condition: + + +stringu1 = 'ATAAAA' + + + and looks up the restriction function for =, which is + eqsel. This case is a bit different, as the most + common values - MCVs, are used to determine the + selectivity. Let's have a look at these, with some extra columns that will + be useful later : + + +SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats +WHERE tablename='tenk1' AND attname='stringu1'; + + null_frac | n_distinct | + most_common_vals | + most_common_freqs +-------------------------------------------------------------------------+----- + 0 | 672 | + {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA} | + {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667} + + + The selectivity is merely the frequency corresponding to 'ATAAAA': + + +selectivity = 0.003 + + + The estimated number of rows is just the product of this with the + cardinality of tenk1 as before : + + +rows = 10000 * 0.003 + = 30 + + + The number displayed by EXPLAIN is one more than this, + due to some post estimation checks. + + + + Now consider the same query, but with a constant that is not in the + MCV list : + + +EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx'; + + QUERY PLAN +---------------------------------------------------------- + Seq Scan on tenk1 (cost=0.00..470.00 rows=15 width=244) + Filter: (stringu1 = 'xxx'::name) + + + This is quite a different problem, how to estimate the selectivity when the + value is not in the MCV list. The approach is to use + the fact that the value is not in the list, combined with the knowledge + of the frequencies for all of the MCVs : + + +selectivity = (1.0 - (0.00333333 + 0.00333333 + 0.003 + 0.003 + 0.003 + + 0.003 + 0.003 + 0.003 + 0.00266667 + 0.00266667)) / (672 - 10) + = 0.001465 + + + That is, add up all the frequencies for the MCVs and subtract them from + one - because it is not one of these, and divide by + the remaining distinct values. Notice that there are no + null values so we don't have to worry about those. The estimated number of + rows is calculated as usual : + + +rows = 10000 * 0.001465 + = 15 + + + + + + In the case where there is more than one condition in the + WHERE clause, for example : + + +EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx'; + + QUERY PLAN +----------------------------------------------------------- + Seq Scan on tenk1 (cost=0.00..495.00 rows=2 width=244) + Filter: ((unique1 < 1000) AND (stringu1 = 'xxx'::name)) + + + then independence is assumed and the selectivities of the individual + restrictions are multiplied together : + + +selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx') + = 0.1031 * 0.001465 + = 0.00015104 + + + The row estimates are calculated as before : + + +rows = 10000 * 0.00015104 + = 2 + + + + + Let's examine a query that includes a JOIN : + + +EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 +WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2; + + QUERY PLAN +----------------------------------------------------------------------------------------- + Nested Loop (cost=0.00..346.90 rows=51 width=488) + -> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..192.57 rows=51 width=244) + Index Cond: (unique1 < 50) + -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..3.01 rows=1 width=244) + Index Cond: ("outer".unique2 = t2.unique2) + + + The restriction on tenk1 + unique1 < 50 is evaluated before the nested-loop join. + This is handled analogously to the initial example. The restriction operator + for < is scalarlteqsel as before, + but this time the value 50 is in the first bucket of the + unique1 histogram : + + +selectivity = ((50 - 1) / (970 - 1)) / 10 + = 0.005057 + +rows = 10000 * 0.005057 + = 51 + + + The restriction for the join is : + + +t2.unique2 = t1.unique2 + + + This is due to the join method being nested-loop, with + tenk1 being in the outer loop. The operator is just + our familiar =, however the restriction function is + obtained from the oprjoin column of + pg_operator - and is eqjoinsel. + Additionally we use the statistical information for both + tenk2 and tenk1 : + + +SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats +WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2'; + +tablename | null_frac | n_distinct | most_common_vals +-----------+-----------+------------+------------------ + tenk1 | 0 | -1 | + tenk2 | 0 | -1 | + + + In this case there is no MCV information for + unique2 because all the values appear to be unique, + so we can use an algorithm that relies only on the number of distinct values + for both relations together with their null fractions : + + +selectivity = (1 - 0) * (1 - 0) * min(1 / 10000, 1 / 1000) + = 0.0001 + + + This is, subtract the null fraction from one for each of the relations, + and divide by the maximum of the two distinct values. The number of rows + that the join is likely to emit is calculated as the cardinality of + cartesian product of the two nodes in the nested-loop, multiplied by the + selectivity : + + +rows = 51 * 10000 * 0.0001 + = 51 + + + + + Controlling the Planner with Explicit <literal>JOIN</> Clauses