Re: Latin vs non-Latin words in text search parsing

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Latin vs non-Latin words in text search parsing
Date: 2007-10-22 08:58:03
Message-ID: 471C661B.2050803@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Tom Lane wrote:
>
>> ISTM that perhaps a more generally useful definition would be
>>
>> lword Only ASCII letters
>> nlword Entirely letters per iswalpha(), but not lword
>> word Entirely alphanumeric per iswalnum(), but not nlword
>> (hence, includes at least one digit)
> ...
> I am not sure if there are any western european languages were words can
> only be formed with non-ascii chars.

There is at least in Swedish: "ö" (island) and å (river). They're both a
bit special because they're just one letter each.

> lword Entirely letters per iswalpha, with at least one ASCII
> nlword Entirely letters per iswalpha
> word Entirely alphanumeric per iswalnum, but not nlword

I don't like this categorization much more than the original. The
distinction between lword and nlword is useless for most European
languages.

I suppose that Tom's argument that it's useful to distinguish words made
of purely ASCII characters in computer-oriented stuff is valid, though I
can't immediately think of a use case. For things like parsing a
programming language, that's not really enough, so you'd probably end up
writing your own parser anyway. I'm also not clear what the use case for
the distinction between words with digits or not is. I don't think
there's any natural languages where a word can contain digits, so it
must be a computer-oriented thing as well.

I like the "aword" name more than "lword", BTW. If we change the meaning
of the classes, surely we can change the name as well, right?

Note that the default parser is useless for languages like Japanese,
where words are not separated by whitespace, anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2007-10-22 09:09:47 Re: Latin vs non-Latin words in text search parsing
Previous Message Magnus Hagander 2007-10-22 08:41:14 Re: 8.2.3: Server crashes on Windows using Eclipse/Junit