Quick Links

Re: tsvector pg_stats seems quite a bit off.

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 15:16:35
Message-ID:	4C012FD3.1070609@wulczer.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 29/05/10 17:09, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
>
>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>
> There is definitely something wrong with your math there. It's not
> possible for the 100'th most common word to have a frequency as high
> as 0.06 --- the ones above it presumably have larger frequencies,
> which makes the total quite a lot more than 1.0.

Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

With regards to my other mail this means that top_stopwords = 10 and
error_factor = 10 would mean bucket_width = 7150 and final prune value
of 6787.

Jan

In response to

Re: tsvector pg_stats seems quite a bit off. at 2010-05-29 15:09:13 from Tom Lane

Responses

Re: tsvector pg_stats seems quite a bit off. at 2010-05-29 15:34:38 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-05-29 15:34:38	Re: tsvector pg_stats seems quite a bit off.
Previous Message	Tom Lane	2010-05-29 15:12:40	Re: tsvector pg_stats seems quite a bit off.