Re: Latin vs non-Latin words in text search parsing

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Latin vs non-Latin words in text search parsing
Date: 2007-10-21 21:59:53
Message-ID: 20071021215953.GA12111@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:

> ISTM that perhaps a more generally useful definition would be
>
> lword Only ASCII letters
> nlword Entirely letters per iswalpha(), but not lword
> word Entirely alphanumeric per iswalnum(), but not nlword
> (hence, includes at least one digit)
>
> However, I am no linguist and maybe I'm missing something.

I tend to agree with the need to redefine the categories. I am not sure
I agree with this particular definition though. I would think that a
"latin word" should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');
Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-----------+----------------+--------------------------
word | Word | añadido | {spanish_stem} | spanish_stem: {añad}
blank | Space symbols | | {} |
word | Word | añadió | {spanish_stem} | spanish_stem: {añad}
blank | Space symbols | | {} |
word | Word | añadidura | {spanish_stem} | spanish_stem: {añadidur}
(5 lignes)

I would think those would all fit in the "latin word" category. This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');
Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+------------+----------------+--------------------------
lword | Latin word | caracteres | {spanish_stem} | spanish_stem: {caracter}
blank | Space symbols | | {} |
word | Word | carácter | {spanish_stem} | spanish_stem: {caract}
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars. At least in spanish accents tend
to be rare. However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');
Alias | Description | Token | Dictionaries | Lexized token
--------+----------------+-------+---------------+-----------------
nlword | Non-latin word | à | {french_stem} | french_stem: {}
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword Entirely letters per iswalpha, with at least one ASCII
nlword Entirely letters per iswalpha
word Entirely alphanumeric per iswalnum, but not nlword

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-10-21 22:46:38 Re: Latin vs non-Latin words in text search parsing
Previous Message Tom Lane 2007-10-21 20:47:43 Latin vs non-Latin words in text search parsing