Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date: 2007-11-09 16:47:31
Message-ID: 47348F23.7070002@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

> This example still doesn't seem very convincing --- why would you not
> merely attach the stopword list to the pl_ispell dictionary?

Because the ispell-based dictionaries first stem the lexeme and then
search for it in the stopwords file. The situation here is that a
stopword is first stemmed to produce another lexeme (which is not in the
stopwords file, as it's a perfectly valid word) and then gets indexed,
instead of being discarded.
To restate: the word 'od' in Polish is both a preposition and a declined
form of the noun 'oda'. The ispell dictionary when passed the lexeme
'od' first stems it to produce 'oda' and then fails to find it in the
stopwords file. If I'd include the word 'oda' in the stopwords file, I'd
be losing information about the noun 'oda' appearing in documents.

I'm still trying to find an English example, as I'm sure it would be
easier to understand by most readers of this list. Nothing comes to my
mind, however - I guess some languages just have rotten luck with their
grammar.

> If there is a use-case for it, IMHO it'd be better to add a boolean
> accept-or-pass-on parameter to the "simple" dictionary than to add a
> whole new dictionary type.

Ah, I never thought of it. You may be very right - it does look like an
easier solution. However, it would require coding some basic parsing
logic into the dex_init procedure, because right now the 'simple'
dictionary expects dict_initoption to be a path to the stopwords file.
Do you mean something like 'StopFile="/path/to/stopwords",
AcceptUnknown=0'" ?

Regards,
Jan Urbanski
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zdenek Kotala 2007-11-09 17:02:27 Re: Fix pg_dump dependency on postgres.h
Previous Message Robert Treat 2007-11-09 16:46:05 Re: [COMMITTERS] pgsql: - Add check of already changed page while replay WAL.

Browse pgsql-patches by date

  From Date Subject
Next Message Zdenek Kotala 2007-11-09 17:02:27 Re: Fix pg_dump dependency on postgres.h
Previous Message Bruce Momjian 2007-11-09 16:39:03 Re: Fix for stop words in thesaurus file