Skip site navigation (1) Skip section navigation (2)

Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: Gurmeet Manku <manku(at)CS(dot)Stanford(dot)EDU>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <josh(at)agliodbs(dot)com>,Greg Stark <gsstark(at)mit(dot)edu>,Marko Ristola <marko(dot)ristola(at)kolumbus(dot)fi>,pgsql-perform <pgsql-performance(at)postgresql(dot)org>,<pgsql-hackers(at)postgresql(dot)org>,Utkarsh Srivastava <usriv(at)xenon(dot)Stanford(dot)EDU>, <snt(at)iastate(dot)edu>
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-26 22:00:48
Message-ID: Pine.LNX.4.44.0504261443340.12051-100000@xenon.Stanford.EDU (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance
 Hi everybody!

 Perhaps the following papers are relevant to the discussion here
 (their contact authors have been cc'd):


 1. The following proposes effective algorithms for using block-level 
    sampling for n_distinct estimation:

 "Effective use of block-level sampling in statistics estimation"
 by Chaudhuri, Das and Srivastava, SIGMOD 2004.

 http://www-db.stanford.edu/~usriv/papers/block-sampling.pdf


 2. In a single scan, it is possible to estimate n_distinct by using
    a very simple algorithm:

 "Distinct sampling for highly-accurate answers to distinct value
  queries and event reports" by Gibbons, VLDB 2001.

 http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf


 3. In fact, Gibbon's basic idea has been extended to "sliding windows" 
    (this extension is useful in streaming systems like Aurora / Stream):

 "Distributed streams algorithms for sliding windows"
 by Gibbons and Tirthapura, SPAA 2002.

 http://home.eng.iastate.edu/~snt/research/tocs.pdf


 Thanks,
 Gurmeet

 ----------------------------------------------------
 Gurmeet Singh Manku                      Google Inc.
 http://www.cs.stanford.edu/~manku    (650) 967 1890
 ----------------------------------------------------


In response to

Responses

pgsql-performance by date

Next:From: Matthew NuzumDate: 2005-04-26 22:32:54
Subject: Re: speed up query with max() and odd estimates
Previous:From: John A MeinelDate: 2005-04-26 21:51:09
Subject: Re: speed up query with max() and odd estimates

pgsql-hackers by date

Next:From: David WheelerDate: 2005-04-26 22:06:31
Subject: Re: DO INSTEAD and conditional rules
Previous:From: Tom LaneDate: 2005-04-26 21:58:07
Subject: Re: DO INSTEAD and conditional rules

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group