Quick Links

Re: tsvector pg_stats seems quite a bit off.

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 20:04:07
Message-ID:	4C0021B7.2090604@wulczer.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 28/05/10 04:47, Tom Lane wrote:
> I re-scanned that paper and realized that there is indeed something
> wrong with the way we are doing it. The paper says (last sentence in
> the definition of the algorithm, section 4.2):
>
> When a user requests a list of items with threshold s, we output
> those entries in D where f >= (s-e)N.
>
> What we are actually doing is emitting every entry with f >= 2. Since
> e is fixed at 1/w, this effectively means that we are setting s to be
> only fractionally greater than e, which means very large relative errors
> in the estimates.

I gave it a though and reread the paper, but since I already blundered
once, please verify me on this.

We follow the algorithm as written, the trouble starts when we want to
output the result. The paper says which items from the D structure
should be returned when the user asks for items that have frequencies
higher than a threshold s. What we want to put in the statistics table
are items accompanied by their frequencies, so we need to do some
reasoning before we can construct the result.

Say we have an item I with count f (taken from our D structure). The
total number of entries is N. The question would be: what would be the
minimum frequency that the user could specify, that would still make us
output this element. From

f >= (s - e) * N

we can say it's

s <= (f / N) + e

So if the user wants items that occur with frequency (f / N) + e or
less. This would mean that the corresponding value in the statistics
entry should be < I, (f / N) + e) >

The thing is, this doesn't change much, because currently we are putting
(f / N) there, and e is set to 1 / stats_target * 10, so the difference
would not be dramatic.

> Or, if you want it explained another way: we are emitting words whose f
> is very small and whose delta is very large, representing items that got
> added to the scan very late. These really shouldn't be there because
> their true frequency is probably a lot less than the intended threshold.

Well, the idea of the algorithm is that if their frequency would have
been bigger, they would appear earlier and would survive the pruning, as
I understand it.

> The net effect of this is first that there are a lot of rather useless
> entries in the MCV list whose claimed frequency is quite small, like as
> little as two occurrences. Their true frequency could be quite a bit
> more. What's even worse is that we believe that the minimum of these
> claimed frequencies is a reliable upper bound for the frequencies of
> items not listed in the MCV list.

Per the algorithm it *is* the upper bound, if I got my maths correctly.
The last item in the list would not be returned if the requested
frequency was higher than the one that is associated to that item.

> So I think we have to fix this. The right thing is to select s and e
> values that are actually meaningful in the terms of the paper, then
> compute w from that, and be sure to enforce the minimum f value
> specified in the algorithm ... ie, don't be afraid to throw away values
> in the final D table.

I we should definitely prune the table one last time in the very
probable case that the loop ended in the middle of two regularly
happening prunes.

As for selecting the algorithm parameters: we don't get to select s. We
do get to select e, but that's it. I have a feeling that our problems
are caused by thte fact, that the algorithm tries to answer the question
"which elements occur more frequently than X" and we actually want the
answer to the "what's the frequency of each element". I've almost
convinced myself that the transformation between the answers to these
questions exists, but this trouble report keeps giving me doubts.

Cheers,
Jan

In response to

Re: tsvector pg_stats seems quite a bit off. at 2010-05-28 02:47:57 from Tom Lane

Responses

Re: tsvector pg_stats seems quite a bit off. at 2010-05-28 20:22:25 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-05-28 20:04:18	Re: [PATCH] Add SIGCHLD catch to psql
Previous Message	Dimitri Fontaine	2010-05-28 19:53:39	Re: How to pass around collation information