Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: Gurmeet Manku <manku(at)CS(dot)Stanford(dot)EDU>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <josh(at)agliodbs(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Marko Ristola <marko(dot)ristola(at)kolumbus(dot)fi>, pgsql-perform <pgsql-performance(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, Utkarsh Srivastava <usriv(at)xenon(dot)Stanford(dot)EDU>, <snt(at)iastate(dot)edu>
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-26 22:00:48
Message-ID: Pine.LNX.4.44.0504261443340.12051-100000@xenon.Stanford.EDU
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance


Hi everybody!

Perhaps the following papers are relevant to the discussion here
(their contact authors have been cc'd):

1. The following proposes effective algorithms for using block-level
sampling for n_distinct estimation:

"Effective use of block-level sampling in statistics estimation"
by Chaudhuri, Das and Srivastava, SIGMOD 2004.

http://www-db.stanford.edu/~usriv/papers/block-sampling.pdf

2. In a single scan, it is possible to estimate n_distinct by using
a very simple algorithm:

"Distinct sampling for highly-accurate answers to distinct value
queries and event reports" by Gibbons, VLDB 2001.

http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf

3. In fact, Gibbon's basic idea has been extended to "sliding windows"
(this extension is useful in streaming systems like Aurora / Stream):

"Distributed streams algorithms for sliding windows"
by Gibbons and Tirthapura, SPAA 2002.

http://home.eng.iastate.edu/~snt/research/tocs.pdf

Thanks,
Gurmeet

----------------------------------------------------
Gurmeet Singh Manku Google Inc.
http://www.cs.stanford.edu/~manku (650) 967 1890
----------------------------------------------------

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Wheeler 2005-04-26 22:06:31 Re: DO INSTEAD and conditional rules
Previous Message Tom Lane 2005-04-26 21:58:07 Re: DO INSTEAD and conditional rules

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew Nuzum 2005-04-26 22:32:54 Re: speed up query with max() and odd estimates
Previous Message John A Meinel 2005-04-26 21:51:09 Re: speed up query with max() and odd estimates