Re: tsearch2: enable non ascii stop words with C locale

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: tsearch2: enable non ascii stop words with C locale
Date: 2007-02-12 14:55:11
Message-ID: 45D07FCF.7020407@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Currently tsearch2 does not accept non ascii stop words if locale is
> C. Included patches should fix the problem. Patches against PostgreSQL
> 8.2.3.

I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-word
code. Second, p_islatin() function is used for catching lexemes like URL or HTML
entities, so, it's important to define real latin characters. And it works
right: it calls p_isalpha (already patched for your case), then it calls
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding;
server_encoding
-----------------
UTF8
contrib_regression=# show lc_ctype;
lc_ctype
----------
C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
lexize
--------
{}

Russian characters with UTF8 take two bytes.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2007-02-12 15:29:20 DROP DATABASE and prepared xacts
Previous Message mark 2007-02-12 14:36:07 Re: HOT for PostgreSQL 8.3