Understanding histograms

From: Len Shapiro <len(at)cs(dot)pdx(dot)edu>
To: pgsql-performance(at)postgresql(dot)org
Subject: Understanding histograms
Date: 2008-04-30 04:56:32
Message-ID: 4817FC00.50609@cs.pdx.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I hope I am posting to the right list.
I am running Postgresql 8.1.9 and don't understand the behavior of
histograms for data items not in the MVC list. I teach databases and
want to use Postgres as an example. I will appreciate any help that
anyone can provide.

Here is the data I am using. I am interested only in the "rank" attribute.

CREATE TABLE Sailors (
sid Integer NOT NULL,
sname varchar(20),
rank integer,
age real,
PRIMARY KEY (sid));

I insert 30 sailor rows:

INSERT INTO Sailors VALUES (3, 'Andrew', 10, 30.0);
INSERT INTO Sailors VALUES (17, 'Bart', 5, 30.2);
INSERT INTO Sailors VALUES (29, 'Beth', 3, 30.4);
INSERT INTO Sailors VALUES (28, 'Bryant', 3, 30.6);
INSERT INTO Sailors VALUES (4, 'Cynthia', 9, 30.8);
INSERT INTO Sailors VALUES (16, 'David', 9, 30.9);
INSERT INTO Sailors VALUES (27, 'Fei', 3, 31.0);
INSERT INTO Sailors VALUES (12, 'James', 3, 32.0);
INSERT INTO Sailors VALUES (30, 'Janice', 3, 33.0);
INSERT INTO Sailors VALUES (2, 'Jim', 8, 34.5);
INSERT INTO Sailors VALUES (15, 'Jingke', 10, 35.0);
INSERT INTO Sailors VALUES (26, 'Jonathan',9, 36.0);
INSERT INTO Sailors VALUES (24, 'Kal', 3, 36.6);
INSERT INTO Sailors VALUES (14, 'Karen', 8, 37.8);
INSERT INTO Sailors VALUES (8, 'Karla',7, 39.0);
INSERT INTO Sailors VALUES (25, 'Kristen', 10, 39.5);
INSERT INTO Sailors VALUES (19, 'Len', 8, 40.0);
INSERT INTO Sailors VALUES (7, 'Lois', 8, 41.0);
INSERT INTO Sailors VALUES (13, 'Mark', 7, 43.0);
INSERT INTO Sailors VALUES (18, 'Melanie', 1, 44.0);
INSERT INTO Sailors VALUES (5, 'Niru', 5, 46.0);
INSERT INTO Sailors VALUES (23, 'Pavel', 3, 48.0);
INSERT INTO Sailors VALUES (1, 'Sergio', 7, 50.0);
INSERT INTO Sailors VALUES (6, 'Suhui', 1, 51.0);
INSERT INTO Sailors VALUES (22, 'Suresh',9, 52.0);
INSERT INTO Sailors VALUES (20, 'Tim',7, 54.0);
INSERT INTO Sailors VALUES (21, 'Tom', 10, 56.0);
INSERT INTO Sailors VALUES (11, 'Warren', 3, 58.0);
INSERT INTO Sailors VALUES (10, 'WuChang',9, 59.0);
INSERT INTO Sailors VALUES (9, 'WuChi', 10, 60.0);

after analyzing, I access the pg_stats table with

SELECT n_distinct, most_common_vals,
most_common_freqs, histogram_bounds
FROM pg_stats WHERE tablename = 'sailors' AND attname = 'rank';

and I get:

n_distinct most_common_vals most_common_freqs
histogram_bounds
-0.233333
{3,9,10,7,8} {0.266667,0.166667,0.166667,0.133333,0.133333}
{1,5}

I have two questions. I'd appreciate any info you can provide,
including pointers to the source code.

1. Why does Postgres come up with a negative n_distinct? It
apparently thinks that the number of rank values will increase as the
number of sailors increases. What/where is the algorithm that decides
that?

2. The most_common_vals and their frequencies make sense. They say
that the values {3,9,10,7,8} occur a total of 26 times, so other
values occur a total of 4 times. The other, less common, values are 1
and 5, each occuring twice, so the histogram {1,5} is appropriate.
If I run the query
EXPLAIN SELECT * from sailors where rank = const;
for any const not in the MVC list, I get the plan

Seq Scan on sailors (cost=0.00..1.38 rows=2 width=21)
Filter: (rank = const)

The "rows=2" estimate makes sense when const = 1 or 5, but it makes no
sense to me for other values of const not in the MVC list.
For example, if I run the query
EXPLAIN SELECT * from sailors where rank = -1000;
Postgres still gives an estimate of "row=2".
Can someone please explain?

Thanks,

Len Shapiro
Portland State University

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-04-30 05:19:51 Re: Understanding histograms
Previous Message Merlin Moncure 2008-04-29 23:27:20 Re: SSD database benchmarks