Re: tsvector pg_stats seems quite a bit off.

From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Jesper Krogh <jesper(at)krogh(dot)cc>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: tsvector pg_stats seems quite a bit off.
Date: 2010-05-29 13:56:57
Message-ID: 4C011D29.2070309@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/05/10 12:34, Jesper Krogh wrote:
> On 2010-05-28 23:47, Jan Urbański wrote:
>> On 28/05/10 22:22, Tom Lane wrote:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>>
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
>>
> I think you should skip the assumption about stop-words, users may
> use something where they are needed in the index or have a language
> than the typical. (and they dont seem to influcence the math that much).

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

> Isn't it the same "type" of logic that is used for collecting statistics
> for
> "array-types", say integer-arrays and text arrays?

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).

>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>>
>> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
>>
>
> Im not sure I get this one.. does this mean that we prune everytime
> we have collected 167 new datapoints .. that would seem too often
> for me since that would roughly be once per "row".

Hm, if we pick s to be 0.06, we say that the K'th word in the English
language will have a frequency of 0.06, so if we want to have statistics
with an error of s/10, we can prune every 167 lexemes (K is the
statistics target, possibly +top_stopwords).

Hm, I am now thinking that maybe this theory is flawed, because tsvecors
contain only *unique* words, and Zipf's law is talking about words in
documents in general. Normally a word like "the" would appear lots of
times in a document, but (even ignoring the fact that it's a stopword
and so won't appear at all) in a tsvector it will be present only once.
This may or may not be a problem, not sure if such "squashing" of
occurences as tsvectors do skewes the distribution away from Zipfian or not.

Anyway, figuring that out would require some more math and thinking, and
to fix the problem at hand we can say Zipf is good enough.

>> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N
>>
>> So assuming that on average a tsvector has 154 elements and that we went
>> through 35017 rows (as it would be in Jesper's case, before he raised
>> the stats target from 100 to 1000), we will remove lexemes with f<
>> 0.054 * 35017 * 154 that is f< 291201.37
>>
>> I wonder what would happen if Jasper's case if we did that... And I
>> wonder how sound that maths is
>>
>
> If it means that I would get an accurate MCE-histogram for all
> things that have an occourance of more than 5.4% of the rows
> (given the samples chosen), then I think that would be really
> reasonable.

Here's the spreadsheet spat out.

The variables are:
* the statistics target
* top stopwords
* error factor

Where top stopwords is the number of top words in the English language
that would be stopwords. You can also think about it as the smudge
factor determinig how well do we trust that the distribution is Zipfian.
Theoretically if you want to keep X values in the MCE array, you should
discard inputs with frequency lower than the frequency of the X'th value
in a Zipfian distribution. If you would write out all English words and
their frequencies (according to Zipf's law), the top Y of them would be
stopwords. We want to discard words with frequency that's lower than X +
Y, and then we probably want to have some breathing space as well. That
cutoff frequency is called s in the LC algorithm.

Error factor determines the relation between s and e, since apparently
we want e to be proportional to s (e is the error from the LC
algorithm). It directly determines the bucket width, since the larger
the bucket, the more accurate the results will be, as there will be less
pruning going on.

There are also constants: H(len(eng)) is the harmonic number from Zipf's
law, that assuming 1e6 words in English is 6.5. tsvector length and rows
in sample are just some values to get concrete numbers out. They
influence the final pruning frequency, because the rule is f < (s-e)N
and N is the total number of lexemes seen

The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.

With your dataset this would mean pruning every 3076 lexemes and
discarding from the result all lexemes with < 173507 occurrences. With
statistics target set to 1000 it would change to 16923 and 31546,
respectively.

> I can "fairly easy" try out patches or do other kind of testing.

I'll try to come up with a patch for you to try and fiddle with these
values before Monday.

Cheers,
Jan

Attachment Content-Type Size
lc.csv text/csv 775 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2010-05-29 14:09:12 Re: pg_trgm
Previous Message Jesper Krogh 2010-05-29 12:07:53 Statistics for tsvector "wildcards". term*