Re: estimating # of distinct values

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: estimating # of distinct values
Date: 2010-12-28 12:06:53
Message-ID: AANLkTik+_9VvXg4Cyb2pzUSazmnjxQfePE3WBDwvnyLH@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> While I don't want to discourage you from working on steam-based
> estimators ... I'd love to see you implement a proof-of-concept for
> PostgreSQL, and test it ... the above is a non-argument.  It requires us
> to accept that sample-based estimates cannot ever be made to work,
> simply because you say so.

This argument has been made on this mailing list and on
pgsql-performance many times, so I have been assuming that this is
settled mathematics. Admittedly, I don't have a citation for this...

> I would agree that it's impossible to get a decent estimate of
> n-distinct from a 1% sample.  But there's a huge difference between 5%
> or 10% and "a majority of the table".

Considering the way that disks work, it's not that huge. If the table
is small enough to fit in memory conveniently, it probably wouldn't be
unacceptably costly even to read the whole thing. But if it's a
terabyte on disk, reading every tenth block is probably going to carry
most of the I/O cost of reading every block. Even reading every
twentieth or hundredth block is going to be significantly more
expensive than what we do now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-12-28 12:09:14 Re: "writable CTEs"
Previous Message Magnus Hagander 2010-12-28 12:05:36 Re: Streaming replication as a separate permissions