Quick Links

Re: tsvector pg_stats seems quite a bit off.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 02:47:57
Message-ID:	16212.1275014877@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> On 19/05/10 21:01, Jesper Krogh wrote:
>> In practice, just cranking the statistics estimate up high enough seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?

> The algorithm to determine most common vals does not do it accurately.
> That would require keeping all lexemes from the analysed tsvectors in
> memory, which would be impractical. If you want to learn more about the
> algorithm being used, try reading
> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
> ts_typanalyze.c

I re-scanned that paper and realized that there is indeed something
wrong with the way we are doing it. The paper says (last sentence in
the definition of the algorithm, section 4.2):

When a user requests a list of items with threshold s, we output
those entries in D where f >= (s-e)N.

What we are actually doing is emitting every entry with f >= 2. Since
e is fixed at 1/w, this effectively means that we are setting s to be
only fractionally greater than e, which means very large relative errors
in the estimates.

Or, if you want it explained another way: we are emitting words whose f
is very small and whose delta is very large, representing items that got
added to the scan very late. These really shouldn't be there because
their true frequency is probably a lot less than the intended threshold.

The net effect of this is first that there are a lot of rather useless
entries in the MCV list whose claimed frequency is quite small, like as
little as two occurrences. Their true frequency could be quite a bit
more. What's even worse is that we believe that the minimum of these
claimed frequencies is a reliable upper bound for the frequencies of
items not listed in the MCV list. Both of these errors are manifest
in Jesper's description of his case, and they're also easy to reproduce
if you just take stats on any reasonable-size collection of documents.
Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list.

So I think we have to fix this. The right thing is to select s and e
values that are actually meaningful in the terms of the paper, then
compute w from that, and be sure to enforce the minimum f value
specified in the algorithm ... ie, don't be afraid to throw away values
in the final D table.

regards, tom lane

In response to

Re: tsvector pg_stats seems quite a bit off. at 2010-05-25 23:16:55 from Jan Urbański

Responses

Re: tsvector pg_stats seems quite a bit off. at 2010-05-28 08:28:27 from Jan Urbański
Re: tsvector pg_stats seems quite a bit off. at 2010-05-28 20:04:07 from Jan Urbański
Re: tsvector pg_stats seems quite a bit off. at 2010-05-29 10:43:28 from Jesper Krogh

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-05-28 02:56:12	Re: merge join killing performance
Previous Message	Scott Marlowe	2010-05-28 02:46:21	Re: merge join killing performance