Re: BUG #16235: ts_rank ignores match and only considers lower weighted vector

From: Dominik Giger <dominik(dot)giger(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16235: ts_rank ignores match and only considers lower weighted vector
Date: 2020-01-28 10:50:20
Message-ID: CAGFNN0Y1KP_tjeAvaHqYr6fR3kEngbQeAyFaj7wF+1NaUEUAqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Jan 27, 2020 at 11:35 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> > The following query shows the problem:
>
> > select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as
> > rank_correct
> > from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
> > setweight(to_tsvector('simple', 'foobar'), 'C') as doc1,
> > setweight(to_tsvector('simple', 'foo something'), 'A') as
> > doc2,
> > to_tsquery('simple', 'foo:* & something') as
> > query) as subquery;
>
> > ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only
> > consider the 'foobar' term with lower weight when calculating the rank. The
> > foo:1A is only considered in doc2.
>
> No, that's not correct. What it actually is doing is taking some sort of
> average of the weights of the occurrences, as you can see if you play
> around with a few more examples besides these two. That could be better
> documented, perhaps, but I don't think it's obviously broken.
>
> I can see that there might be a use for taking the max or even the sum
> of the weights rather than an average --- in many situations it wouldn't
> be desirable to rank doc1 of your example lower than doc2. But really
> that'd be a different ranking algorithm, not a bug fix for this one.
>
> The manual claims you can write your own ranking algorithm ... but
> AFAICS you'd have to code it in C, because we aren't exposing anything
> at SQL level that would let you get at the raw match data :-(.
> So there's room for improvement there.
>
> Also, you might try using ts_rank_cd() instead, as that uses a different
> algorithm for combining the weights. At least on this example, doc1
> gets a higher score than doc2.
>
> regards, tom lane

I see, thank you for the explanation.

Maybe I can add another reason why I think it might be a bug. Consider
this query:

select ts_rank(doc1, query) as rank_wrong,
ts_rank(doc2, query) as rank_correct
from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
setweight(to_tsvector('simple', 'foobar'), 'C') as doc1,
setweight(to_tsvector('simple', 'foo something'), 'A') as doc2,
to_tsquery('simple', 'foo:*') as
query) as subquery;

Here I only removed the '& something' part of the query. Now the query
behaves as one would expect: The first rank is higher than the second.
I am unsure why adding a second search term (which is contained in
both documents) would lead to a change in the ranking order.

What do you think?

Regards,
Dominik Giger

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-01-28 13:28:15 BUG #16237: When restoring database, backend disconnects or crashes when foreign key is created
Previous Message Johann du Toit 2020-01-28 09:34:15 Re: BUG #16233: Yet another "logical replication worker" was terminated by signal 11: Segmentation fault