Quick Links

Re: tsvector pg_stats seems quite a bit off.

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 21:47:28
Message-ID:	4C0039F0.2080406@wulczer.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 28/05/10 22:22, Tom Lane wrote:
> The idea that I was toying with is to assume a Zipfian distribution of
> the input (with some reasonable parameter), and use that to estimate
> what the frequency of the K'th element will be, where K is the target
> number of MCV entries or perhaps a bit more. Then use that estimate as
> the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
> per the paper. If the eventual filtering results in a lot less than the
> target number of MCV entries (because the input wasn't so Zipfian), we
> lose, but at least we have accurate numbers for the entries we kept.

I see what you mean, so the idea would be:

* assume some value of W as the number of all words in the language
* estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
* set e = s/10 and w = 1/e, that is 10/s
* perform LC using that value of w
* remove all elements for which f < (s-e)N, that is f < 0.9*sN, where N
is the total number of lexemes processed
* create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes

After that, we remove lexemes with f < 0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f <
0.054 * 35017 * 154 that is f < 291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is.

>> I we should definitely prune the table one last time in the very
>> probable case that the loop ended in the middle of two regularly
>> happening prunes.
>
> The paper doesn't say that you need to do that. I suspect if you work
> through the math, you'll find out that the minimum-f filter skips
> anything that would have been pruned anyway.

Possible.

Cheers,
Jan

In response to

Re: tsvector pg_stats seems quite a bit off. at 2010-05-28 20:22:25 from Tom Lane

Responses

Re: tsvector pg_stats seems quite a bit off. at 2010-05-29 10:34:43 from Jesper Krogh
Re: tsvector pg_stats seems quite a bit off. at 2010-05-29 15:09:13 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2010-05-28 22:39:25	Re: psql's is_select_command is naive
Previous Message	Martijn van Oosterhout	2010-05-28 21:23:14	Re: How to pass around collation information