Re: Latin vs non-Latin words in text search parsing

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: heikki(at)enterprisedb(dot)com
Cc: alvherre(at)commandprompt(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latin vs non-Latin words in text search parsing
Date: 2007-10-22 09:09:47
Message-ID: 20071022.180947.55724535.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Alvaro Herrera wrote:
> > Tom Lane wrote:
> >
> >> ISTM that perhaps a more generally useful definition would be
> >>
> >> lword Only ASCII letters
> >> nlword Entirely letters per iswalpha(), but not lword
> >> word Entirely alphanumeric per iswalnum(), but not nlword
> >> (hence, includes at least one digit)
> > ...
> > I am not sure if there are any western european languages were words can
> > only be formed with non-ascii chars.
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.
>
> > lword Entirely letters per iswalpha, with at least one ASCII
> > nlword Entirely letters per iswalpha
> > word Entirely alphanumeric per iswalnum, but not nlword
>
> I don't like this categorization much more than the original. The
> distinction between lword and nlword is useless for most European
> languages.
>
> I suppose that Tom's argument that it's useful to distinguish words made
> of purely ASCII characters in computer-oriented stuff is valid, though I
> can't immediately think of a use case. For things like parsing a
> programming language, that's not really enough, so you'd probably end up
> writing your own parser anyway. I'm also not clear what the use case for
> the distinction between words with digits or not is. I don't think
> there's any natural languages where a word can contain digits, so it
> must be a computer-oriented thing as well.
>
> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?
>
> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

Above is true but that does not neccessary mean that Tsearch is not
used for Japanese at all. I overcome the problem above by doing a
pre-process step which separate Japanese sentences to words devided by
white space. I wish I could write a new parser which could do the
job for 8.4 or later...

Please change the word definition very carefully.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2007-10-22 09:47:26 Re: Ready for beta2?
Previous Message Heikki Linnakangas 2007-10-22 08:58:03 Re: Latin vs non-Latin words in text search parsing