Re: Latin vs non-Latin words in text search parsing

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Latin vs non-Latin words in text search parsing
Date: 2007-10-22 10:31:59
Message-ID: 871wbn1luo.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>
>>> ISTM that perhaps a more generally useful definition would be
>>>
>>> lword Only ASCII letters
>>> nlword Entirely letters per iswalpha(), but not lword
>>> word Entirely alphanumeric per iswalnum(), but not nlword
>>> (hence, includes at least one digit)
>> ...
>> I am not sure if there are any western european languages were words can
>> only be formed with non-ascii chars.
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.

For what it's worth I did the same search last night and found three French
words including "çà" -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with "íð" and "óð".

> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?

I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect "ascii", "word", and "token"
for the three categories Tom describes.

> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2007-10-22 10:41:25 Re: FD_SETSIZE limitation in Windows hamstringing pgbench.c
Previous Message Gregory Stark 2007-10-22 10:11:47 FD_SETSIZE limitation in Windows hamstringing pgbench.c