Skip site navigation (1) Skip section navigation (2)

a tsearch2 (8.2.4) dictionary that only filters out stopwords

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: pgsql-patches(at)postgresql(dot)org
Subject: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date: 2007-11-09 01:22:34
Message-ID: 4733B65A.9030707@students.mimuw.edu.pl (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
Hi,

the rationale for this patch is rather complicated, as it's related to
the peculiarities of Polish grammar. Please read on.

I'm using PostgreSQL 8.2.4 and the ispell tsearch2 dictionary. The
problem is as follows. In Polish (and possibly other languages that
don't come to my mind at the moment) a noun can take different forms
depending on the grammatical context. This is called declension. For
exmple the noun 'oda' (which means 'ode' in English) can take the form
'od' in certain cases. However, the word in Polish 'od' is also a
preposition. The problem with the ispell dictionary is that it first
reduces a lexeme to it's stem and then checks whether it is or is not a
stopword.

This means that I either have to agree with the fact that the tsvectors
for my documents will contain large numbers of the noun 'oda' (because
each time a preposition 'od' is used in the text it will be stemmed to
produce 'oda' and then indexed) or I have to include the word 'oda' in
the stopwords file and thus eliminate a perfectly good noun from my
tsvectors.

The solution I came up with was simple: write a dictionary, that does
only one thing: looks up the lexeme in a stopwords file and either
discards it or returns NULL. That way I could use it as the first
dictionary is the dictionary stach for lexeme types I'm interested in
and it would discard every instance of 'od', while passing every
non-stopword (in particular 'oda') to the ispell dictionary.

Tha attached patch adds a dictionary called stop to the set of standard
dictionaries that one gets after installing tsearch2. The C code may not
be first-class (however it works for me in a real business solution) -
it's quite trivial and I'd be happy if some more experienced Postgres
hackers would implement the idea in a cleaner/safer way. It's been
tested on 8.2.4 and compiles on 8.2.5. I haven't even looked at the code
for 8.3 yet, but maybe the change could somehow make it's way into the
integrated full text search?

Regards,
Jan Urbanski
Warsaw University
http://fiok.pl/

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Responses

pgsql-hackers by date

Next:From: Trevor TalbotDate: 2007-11-09 01:46:08
Subject: Re: New tzdata available
Previous:From: Alvaro HerreraDate: 2007-11-09 00:50:13
Subject: Re: Free Space Map thoughts

pgsql-patches by date

Next:From: Bruce MomjianDate: 2007-11-09 02:32:09
Subject: Fix for stop words in thesaurus file
Previous:From: Bruce MomjianDate: 2007-11-09 00:51:36
Subject: Re: Contrib docs v1

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group