Skip site navigation (1) Skip section navigation (2)

Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date: 2007-11-09 16:47:31
Message-ID: 47348F23.7070002@students.mimuw.edu.pl (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
> This example still doesn't seem very convincing --- why would you not
> merely attach the stopword list to the pl_ispell dictionary?

Because the ispell-based dictionaries first stem the lexeme and then
search for it in the stopwords file. The situation here is that a
stopword is first stemmed to produce another lexeme (which is not in the
stopwords file, as it's a perfectly valid word) and then gets indexed,
instead of being discarded.
To restate: the word 'od' in Polish is both a preposition and a declined
form of the noun 'oda'. The ispell dictionary when passed the lexeme
'od' first stems it to produce 'oda' and then fails to find it in the
stopwords file. If I'd include the word 'oda' in the stopwords file, I'd
be losing information about the noun 'oda' appearing in documents.

I'm still trying to find an English example, as I'm sure it would be
easier to understand by most readers of this list. Nothing comes to my
mind, however - I guess some languages just have rotten luck with their
grammar.

> If there is a use-case for it, IMHO it'd be better to add a boolean
> accept-or-pass-on parameter to the "simple" dictionary than to add a
> whole new dictionary type.

Ah, I never thought of it. You may be very right - it does look like an
easier solution. However, it would require coding some basic parsing
logic into the dex_init procedure, because right now the 'simple'
dictionary expects dict_initoption to be a path to the stopwords file.
Do you mean something like 'StopFile="/path/to/stopwords",
AcceptUnknown=0'" ?

Regards,
Jan Urbanski
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Responses

pgsql-hackers by date

Next:From: Zdenek KotalaDate: 2007-11-09 17:02:27
Subject: Re: Fix pg_dump dependency on postgres.h
Previous:From: Robert TreatDate: 2007-11-09 16:46:05
Subject: Re: [COMMITTERS] pgsql: - Add check of already changed page while replay WAL.

pgsql-patches by date

Next:From: Zdenek KotalaDate: 2007-11-09 17:02:27
Subject: Re: Fix pg_dump dependency on postgres.h
Previous:From: Bruce MomjianDate: 2007-11-09 16:39:03
Subject: Re: Fix for stop words in thesaurus file

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group