Re: Thoughts on statistics for continuously advancing columns

From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on statistics for continuously advancing columns
Date: 2009-12-30 19:20:51
Message-ID: 115e229bf38b5269dccf666f4b99da41@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 30 Dec 2009 18:42:38 +0000, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> I'm a bit puzzled by people's repeated suggestion here that large
> tables take a long time to analyze. The sample analyze takes to
> generate statistics is not heavily influenced by the size of the
> table. Your 1TB table should take basically the same amount of time as
> a 1GB table or a 1MB table (if it wasn't already in cache).

No.

postgres=# analyze verbose test_one_million;
INFO: analyzing "public.test_one_million"
INFO: "test_one_million": scanned 3000 of 4425 pages, containing 677950
live rows and 0 dead rows; 3000 rows in sample, 999976 estimated total rows
ANALYZE
Time: 168.009 ms
postgres=# analyze verbose test_one_million;
INFO: analyzing "public.test_one_million"
INFO: "test_one_million": scanned 3000 of 4425 pages, containing 677950
live rows and 0 dead rows; 3000 rows in sample, 999976 estimated total rows
ANALYZE
Time: 104.006 ms
postgres=# analyze verbose test_ten_million;
INFO: analyzing "public.test_ten_million"
INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000
live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total
rows
ANALYZE
Time: 20145.148 ms
postgres=# analyze verbose test_ten_million;
INFO: analyzing "public.test_ten_million"
INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000
live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total
rows
ANALYZE
Time: 18481.053 ms
postgres=# analyze verbose test_ten_million;
INFO: analyzing "public.test_ten_million"
INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000
live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total
rows
ANALYZE
Time: 17653.006 ms

The test_one_million when in cache and out is very quick. I don't think
the ten million is actually able to get into cache (small box) but either
way
if you look at the on disk number for the one million 168ms versus the on
disk number for the ten million, they are vastly different.

postgres=# select
pg_size_pretty(pg_total_relation_size('test_one_million'));
pg_size_pretty
----------------
35 MB
(1 row)

Time: 108.006 ms
postgres=# select
pg_size_pretty(pg_total_relation_size('test_ten_million'));
pg_size_pretty
----------------
346 MB
(1 row)

>
> Unless the reason why it's 1TB is that the columns are extremely wide
> rather than that it has a lot of rows?

I should have qualified, yes they are very wide.

JD

--
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2009-12-30 19:21:37 Re: Thoughts on statistics for continuously advancing columns
Previous Message Andres Freund 2009-12-30 19:06:30 Re: Hot Standy introduced problem with query cancel behavior