Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, Greg Stark <gsstark(at)mit(dot)edu>, Marko Ristola <marko(dot)ristola(at)kolumbus(dot)fi>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-26 21:02:31
Message-ID: 1114549351.21529.363.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> >> It's not just the scan --- you also have to sort, or something like
> >> that, if you want to count distinct values. I doubt anyone is really
> >> going to consider this a feasible answer for large tables.
>
> > Assuming you don't use the HashAgg plan, which seems very appropriate
> > for the task? (...but I understand the plan otherwise).
>
> The context here is a case with a very large number of distinct
> values...

Yes, but is there another way of doing this other than sampling a larger
proportion of the table? I don't like that answer either, for the
reasons you give.

The manual doesn't actually say this, but you can already alter the
sample size by setting one of the statistics targets higher, but all of
those samples are fixed sample sizes, not a proportion of the table
itself. It seems reasonable to allow an option to scan a higher
proportion of the table. (It would be even better if you could say "keep
going until you run out of memory, then stop", to avoid needing to have
an external sort mode added to ANALYZE).

Oracle and DB2 allow a proportion of the table to be specified as a
sample size during statistics collection. IBM seem to be ignoring their
own research note on estimating ndistinct...

> keep in mind also that we have to do this for *all* the
> columns of the table.

You can collect stats for individual columns. You need only use an
option to increase sample size when required.

Also, if you have a large table and the performance of ANALYZE worries
you, set some fields to 0. Perhaps that should be the default setting
for very long text columns, since analyzing those doesn't help much
(usually) and takes ages. (I'm aware we already don't analyze var length
column values > 1024 bytes).

> A full-table scan for each column seems
> right out to me.

Some systems analyze multiple columns simultaneously.

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2005-04-26 21:41:20 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous Message Mak, Jason 2005-04-26 20:57:09 populating a table via the COPY command using C code.

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Dunstan 2005-04-26 21:41:20 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous Message Matthew Nuzum 2005-04-26 21:02:12 Re: speed up query with max() and odd estimates