Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

From: "Dave Held" <dave(dot)held(at)arrayservicesgrp(dot)com>
To: "pgsql-perform" <pgsql-performance(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Date: 2005-04-25 21:00:18
Message-ID: 49E94D0CFCD4DB43AFBA928DDD20C8F9026184D8@asg002.asg.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

> -----Original Message-----
> From: Josh Berkus [mailto:josh(at)agliodbs(dot)com]
> Sent: Sunday, April 24, 2005 2:08 PM
> To: Andrew Dunstan
> Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform;
> pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> Actually, that paper looks *really* promising. Does anyone here have
> enough math to solve for D(sub)Md on page 6? I'd like to test it on
> samples of < 0.01%.
> [...]

D_Md = [1 - sqrt(f_1 / s)] D_b + sqrt(f_1 / s) D_B

s = block size

f~_1 = median frequency within blocks for distinct values occurring in
only one block

D_b = d + f_1^(b+1)

d = distinct classes in the sample

f_1^(b+1) = number of distinct values occurring in a single block in
a sample of b+1 blocks

D_B = d + [B / (b + 1)] f_1^(b+1)

b = sample size (in blocks)

B = total table size (in blocks)

f_k and f~_k are the only tricky functions here, but they are easy to
understand:

Suppose our column contains values from the set {a, b, c, ..., z}.
Suppose we have a sample of b = 10 blocks.
Suppose that the value 'c' occurs in exactly 3 blocks (we don't care
how often it occurs *within* those blocks).
Suppose that the value 'f' also occurs in exactly 3 blocks.
Suppose that the values 'h', 'p', and 'r' occur in exactly 3 blocks.
Suppose that no other value occurs in exactly 3 blocks.

f_3^b = 5

This is because there are 5 distinct values that occur in exactly
3 blocks. f_1^b is the number of distinct values that occur in
exactly 1 block, regardless of how often it occurs within that block.

Note that when you select a sample size of b blocks, you actually
need to sample b+1 blocks to compute f_1^(b+1). This is actually
pedantry since all occurrences of b in the formula are really b+1.

f~ is slightly trickier. First, we pick the distinct values that
occur in only one block. Then, we count how often each value
occurs within its block. To wit:

Suppose we have a set {d, q, y, z} of values that occur in only
one block.
Suppose that d occurs 3x, q occurs 1x, y occurs 8x, and z occurs 6x.

The function f- would take the mean of these counts to determine
the "cluster frequency". So f- here would be 4.5. This allows
one to compute D_MF.

The function f~ takes the median of this sample, which is 3 or 6
(or I suppose you could even take the mean of the two medians if
you wanted).

No tricky math involved. That should be enough to tell you how to
write the estimator.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-04-25 21:10:55 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous Message Andrew Dunstan 2005-04-25 20:43:10 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2005-04-25 21:10:55 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Previous Message Andrew Dunstan 2005-04-25 20:43:10 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?