Re: tsvector pg_stats seems quite a bit off.

From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: tsvector pg_stats seems quite a bit off.
Date: 2010-05-29 16:14:36
Message-ID: 4C013D6C.9030001@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/05/10 17:34, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
>> On 29/05/10 17:09, Tom Lane wrote:
>>> There is definitely something wrong with your math there. It's not
>>> possible for the 100'th most common word to have a frequency as high
>>> as 0.06 --- the ones above it presumably have larger frequencies,
>>> which makes the total quite a lot more than 1.0.
>
>> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
>> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014
>
> Um, apparently I can't do simple arithmetic first thing in the morning
> either, cause I got my number wrong too ;-)
>
> After a bit more research: if you use the basic form of Zipf's law
> with a 1/k distribution, the first frequency has to be about 0.07
> to make the total come out to 1.0 for a reasonable number of words.
> So we could use s = 0.07 / K when we wanted a final list of K words.
> Some people (including the LC paper) prefer a higher exponent, ie
> 1/k^S with S around 1.25. That makes the F1 value around 0.22 which
> seems awfully high for the type of data we're working with, so I think
> the 1/k rule is probably what we want here.

OK, I think we're getting somewhere :o)

I took the formula from Wikipedia's page on Zipf's law, assuming an
exponent of 1:

rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is
the number of words in English

Then I took the nth harmonic number expansion from the page on harmonic
numbers:

H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 +
O(n^-6)

Assuming 1 million words in English and the big-O term in the harmonic
expansion to be 1, we get H(1e6) = 14.3927, which would make the
frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say
0.07).

Which brings me to the same result as yours, which in turn reassures me
a lot ;) My previous result was wrong because I used the wrong logarithm
base, go figure.

So with this, for statistics target of 100 we would predict the
frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in
the input the bucket width and the final pruning value depend only on
the epsilon that we choose for the LC algorithm.

So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 =
1428 lexemes and would not discard anything from the result. If we want
e to be s/2 we'd prune every 2857 lexemes and discard lexemes with
counts < 1887. For s/3, s/4 etc the numbers look like this:

s/1 1428 0
s/2 2857 1887
s/3 4285 2516
s/4 5714 2831
s/5 7142 3019
s/6 8571 3145
s/7 10000 3235
s/8 11428 3302
s/9 12857 3355

s/2 or s/3 look reasonable.

So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed frequency of the most common word and
epsilon equal to s/2 or s/3?

Cheers,
Jan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-05-29 16:38:31 Re: tsvector pg_stats seems quite a bit off.
Previous Message Tom Lane 2010-05-29 15:34:38 Re: tsvector pg_stats seems quite a bit off.