23rd May 2019: PostgreSQL 12 Beta 1 Released!

As we saw in the previous section, the query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. This section provides a quick look at the statistics that the system uses for these estimates.

One component of the statistics is the total number of entries
in each table and index, as well as the number of disk blocks
occupied by each table and index. This information is kept in
`pg_class`'s `reltuples` and `relpages` columns. We can look at it with
queries similar to this one:

regression=# SELECT relname, relkind, reltuples, relpages FROM pg_class regression-# WHERE relname LIKE 'tenk1%'; relname | relkind | reltuples | relpages ---------------+---------+-----------+---------- tenk1 | r | 10000 | 233 tenk1_hundred | i | 10000 | 30 tenk1_unique1 | i | 10000 | 30 tenk1_unique2 | i | 10000 | 30 (4 rows)

Here we can see that `tenk1`
contains 10000 rows, as do its indexes, but the indexes are
(unsurprisingly) much smaller than the table.

For efficiency reasons, `reltuples`
and `relpages` are not updated
on-the-fly, and so they usually contain only approximate values
(which is good enough for the planner's purposes). They are
initialized with dummy values (presently 1000 and 10
respectively) when a table is created. They are updated by
certain commands, presently `VACUUM`,
`ANALYZE`, and `CREATE
INDEX`. A stand-alone `ANALYZE`, that
is one not part of `VACUUM`, generates an
approximate `reltuples` value since it
does not read every row of the table.

Most queries retrieve only a fraction of the rows in a table,
due to having `WHERE` clauses that
restrict the rows to be examined. The planner thus needs to make
an estimate of the *selectivity* of
`WHERE` clauses, that is, the fraction of
rows that match each clause of the `WHERE`
condition. The information used for this task is stored in the
`pg_statistic` system catalog. Entries
in `pg_statistic` are updated by
`ANALYZE` and `VACUUM
ANALYZE` commands, and are always approximate even when
freshly updated.

Rather than look at `pg_statistic`
directly, it's better to look at its view `pg_stats` when examining the statistics
manually. `pg_stats` is designed to be
more easily readable. Furthermore, `pg_stats` is readable by all, whereas `pg_statistic` is only readable by the superuser.
(This prevents unprivileged users from learning something about
the contents of other people's tables from the statistics. The
`pg_stats` view is restricted to show
only rows about tables that the current user can read.) For
example, we might do:

regression=# SELECT attname, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 'road'; attname | n_distinct | most_common_vals ---------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- name | -0.467008 | {"I- 580 Ramp","I- 880 Ramp","Sp Railroad ","I- 580 ","I- 680 Ramp","I- 80 Ramp","14th St ","5th St ","Mission Blvd","I- 880 "} thepath | 20 | {"[(-122.089,37.71),(-122.0886,37.711)]"} (2 rows) regression=#

Table
10-1 shows the columns that exist in `pg_stats`.

Table 10-1. `pg_stats`
Columns

Name | Type | Description |
---|---|---|

tablename |
name |
Name of the table containing the column |

attname |
name |
Column described by this row |

null_frac |
real |
Fraction of column's entries that are null |

avg_width |
integer |
Average width in bytes of the column's entries |

n_distinct |
real |
If greater than zero, the estimated number of
distinct values in the column. If less than zero, the
negative of the number of distinct values divided by the
number of rows. (The negated form is used when ANALYZE believes that the number of
distinct values is likely to increase as the table grows;
the positive form is used when the column seems to have a
fixed number of possible values.) For example, -1
indicates a unique column in which the number of distinct
values is the same as the number of rows. |

most_common_vals |
text[] |
A list of the most common values in the column. (Omitted if no values seem to be more common than any others.) |

most_common_freqs |
real[] |
A list of the frequencies of the most common values, i.e., number of occurrences of each divided by total number of rows. |

histogram_bounds | text[] |
A list of values that divide the column's values into
groups of approximately equal population. The most_common_vals, if present, are
omitted from the histogram calculation. (Omitted if
column data type does not have a < operator, or if the most_common_vals list accounts for the
entire population.) |

correlation | real |
Statistical correlation between physical row ordering
and logical ordering of the column values. This ranges
from -1 to +1. When the value is near -1 or +1, an index
scan on the column will be estimated to be cheaper than
when it is near zero, due to reduction of random access
to the disk. (Omitted if column data type does not have a
< operator.) |

The maximum number of entries in the `most_common_vals` and `histogram_bounds` arrays can be set on a
column-by-column basis using the `ALTER TABLE
SET STATISTICS` command. The default limit is presently 10
entries. Raising the limit may allow more accurate planner
estimates to be made, particularly for columns with irregular
data distributions, at the price of consuming more space in
`pg_statistic` and slightly more time
to compute the estimates. Conversely, a lower limit may be
appropriate for columns with simple data distributions.