Skip site navigation (1) Skip section navigation (2)

Re: estimating # of distinct values

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: estimating # of distinct values
Date: 2010-12-28 00:20:31
Message-ID: 4D192D4F.7090902@fuzzy.cz (view raw or flat)
Thread:
Lists: pgsql-hackers
Dne 28.12.2010 00:04, Kevin Grittner napsal(a):
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>  
>> Well, first, those scans occur only once every few hundred million
>> transactions, which is not likely a suitable timescale for
>> maintaining statistics.
>  
> I was assuming that the pass of the entire table was priming for the
> incremental updates described at the start of this thread.  I'm not
> clear on how often the base needs to be updated for the incremental
> updates to keep the numbers "close enough".

Well, that really depends on the workload. If you never remove all
occurences of a given value (e.g. a ZIP code), you don't need to rescan
the table at all.

All you need is to build the stats once and then update them
incrementally - and I belive this could be handled by autovacuum.

>> And second, we keep on having discussions about rejiggering
>> the whole tuple-freezing strategy.  Even if piggybacking on those
>> scans looked useful, it'd be unwise to assume it'll continue to
>> work the same way it does now.
>  
> Sure, it might need to trigger its own scan in the face of heavy
> deletes anyway, since the original post points out that the
> algorithm handles inserts better than deletes, but as long as we
> currently have some sequential pass of the data, it seemed sane to
> piggyback on it when possible.  And maybe we should be considering
> things like this when we weigh the pros and cons of rejiggering. 
> This issue of correlated values comes up pretty often....

Again, there are two types of stats - one of them needs to scan the
whole table (estimate of distinct values), the other one does not
(multi-dimentional histograms).

These two cases are independent - you don't necessarily need both.

Better ndistinct estimates would actually solve some of the current
issues, it's not just a matter of cross-column stats.

regards
Tomas

In response to

pgsql-hackers by date

Next:From: Kevin GrittnerDate: 2010-12-28 01:09:41
Subject: Re: "writable CTEs"
Previous:From: Peter GeogheganDate: 2010-12-28 00:19:47
Subject: Re: "writable CTEs"

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group