Re: Overhauling GUCS

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Andreas Pflug" <pgadmin(at)pse-consulting(dot)de>, "Decibel!" <decibel(at)decibel(dot)org>
Subject: Re: Overhauling GUCS
Date: 2008-06-09 20:35:27
Message-ID: 87d4mqwdm8.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Josh Berkus" <josh(at)agliodbs(dot)com> writes:

> Where analyze does systematically fall down is with databases over 500GB in
> size, but that's not a function of d_s_t but rather of our tiny sample size.

Speak to the statisticians. Our sample size is calculated using the same
theory behind polls which sample 600 people to learn what 250 million people
are going to do on election day. You do NOT need (significantly) larger
samples for larger populations.

In fact where those polls have difficulty is the same place we have some
problems. For *smaller* populations like individual congressional races you
need to have nearly the same 600 sample for each of those small races. That
adds up to a lot more than 600 total. In our case it means when queries cover
a range much less than a whole bucket then the confidence interval increases
too.

Also, our estimates for n_distinct are very unreliable. The math behind
sampling for statistics just doesn't work the same way for properties like
n_distinct. For that Josh is right, we *would* need a sample size proportional
to the whole data set which would practically require us to scan the whole
table (and have a technique for summarizing the results in a nearly constant
sized data structure).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-06-09 20:52:20 Re: pg_dump restore time and Foreign Keys
Previous Message Zdenek Kotala 2008-06-09 19:11:40 Re: handling TOAST tables in autovacuum