Re: tsvector pg_stats seems quite a bit off.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: tsvector pg_stats seems quite a bit off.
Date: 2010-05-29 15:09:13
Message-ID: 19353.1275145753@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.

> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.

> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there. It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law. The interesting property is the rule
"the k'th most common element occurs 1/k as often as the most common one".
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001. That's pretty crude
of course but it seems like the right ballpark.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-05-29 15:12:40 Re: tsvector pg_stats seems quite a bit off.
Previous Message Tom Lane 2010-05-29 14:31:06 Re: pg_trgm