Re: Compound words giving undesirable results with tsearch2

From: Lars Haugseth <njus(at)larshaugseth(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Compound words giving undesirable results with tsearch2
Date: 2006-05-31 06:39:41
Message-ID: 878xoiemda.fsf@durin.larshaugseth.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


* oleg(at)sai(dot)msu(dot)su (Oleg Bartunov) wrote:
|
| On Tue, 30 May 2006, Lars Haugseth wrote:
|
| > I've setup a database using tsearch2, configured with support for compound
| > words according to the excellent guide found here:
| >
| > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words
| >
| > This works fine. There is however one drawback that I'd like to know
| > whether can be remedied. Let's say I want to search for records containing
| > the word 'fritekst', which is a compound Norwegian word meaning
| > 'free text'.
| >
| > testdb=# select to_tsquery('default_norwegian', 'fritekst');
| > to_tsquery
| > ------------------------------
| > 'fritekst' | 'fri' & 'tekst'
| > (1 row)
| >
| > Now, this will indeed match those records, but it will also match any
| > records containing both of the words 'fri' and 'tekst', without regard
| > to whether they are next to each other or in completely different parts
| > of the text being indexed. In many situations, this will lead to a lot
| > of 'false' matches, seen from a user perspective.
| >
| > Ideas on how to handle this problem will be much appreciated.
|
| this is where order by relevance should helps.

Thank you for pointing me to this, I hadn't thought about that.

However, my first try with the rank_cd() function does not quite
produce the results I had expected:

SELECT set_curcfg('default_norwegian');

SELECT id, rank_cd(n, mytscol, to_tsquery('fritekst')) AS rank
FROM mytable
WHERE mytscol @@ to_tsquery('fritekst')
ORDER BY rank DESC;

No matter what value I use for n here, a record where the compound word
'fritekst' appears gets a rank of 0, where as records where the words
'fri' and 'tekst' appears separately all gets a rank > 0, the closer
together, the higher the rank.

If I try to set the value of n to 0, I still get a rank of 0 for a
record containing 'fritekst', and 1 for all records containing 'fri'
and 'tekst'.

When using the rank() function instead of rank_cd() in the query above,
records with the word 'fritekst' seem to score better, but I still get
higher ranks for some records containing the separate words and not the
compound word.

--
Lars Haugseth

"If anyone disagrees with anything I say, I am quite prepared not only to
retract it, but also to deny under oath that I ever said it." -Tom Lehrer

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Pit Müller 2006-05-31 07:12:52 Problem in Pg 8.1.4 with CREATEDB
Previous Message A. Kretschmer 2006-05-31 05:58:25 Re: How to link database A in server X to database B in server Y?