tsearch2 и unicode

From: "Evgeny M(dot) Baldin" <E(dot)M(dot)Baldin(at)inp(dot)nsk(dot)su>
To: pgsql-ru-general(at)postgresql(dot)org
Subject: tsearch2 и unicode
Date: 2007-01-03 08:01:10
Message-ID: Pine.LNX.4.64.0701031220490.9110@star.inp.nsk.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-ru-general

Добрый день

Преамбула: пытаюсь настроить русский полнотекстный поиск в MediaWiki
(wiki-движок который используется в Википедии) с использование в
качестве хранилища PostgreSQL версии 8.1.4

Амбула: MediaWiki пишет текст всегда в UTF8, поэтому, естественно
настроить tsearch2, чтобы работал с unicode. Для этого собрал и установил
tsearch2 из PostgreSQL 8.2 (tsearch2.82.tar.gz с официального сайта
tsearch2), который поддерживает UTF-8. Сделал initdb с ключиком
--locale=ru_RU.UTF-8

Возникло непонимание:

wikidb=# show lc_collate;
lc_collate
-------------
ru_RU.UTF-8
(1 row)

wikidb=# select * from pg_ts_cfg;
ts_name | prs_name | locale
-----------------+----------+--------------
default_russian | default | ru_RU.KOI8-R
utf8_russian | default | ru_RU.UTF-8
simple | default |
default | default | ru_RU.UTF-8
(4 rows)

Есть некое текстовое поле old_text в таблице mediawiki.pagecontent -
запись в UTF8, а клиент у меня в koi8, поэтому для отображения использую
convert

wikidb=# select convert(old_text,'UTF8','KOI8'),
to_tsvector(old_text) from mediawiki.pagecontent
where old_id=1500;

convert | to_tsvector
-------------+-------------
proba проба | 'proba':1
(1 row)

! В результате индексируется только латиница

Аналогичную процедуру сделал когда initdb запускалась с ключиком
--locale=ru_RU.KOI8-R

wikidb=# show lc_collate;
lc_collate
--------------
ru_RU.KOI8-R

wikidb=# select * from pg_ts_cfg;
ts_name | prs_name | locale
-----------------+----------+--------------
default_russian | default | ru_RU.KOI8-R
utf8_russian | default | ru_RU.UTF-8
simple | default |
default | default | ru_RU.KOI8-R
(4 rows)

wikidb=# select convert(old_text,'UTF8','KOI8'),
to_tsvector(convert(old_text,'UTF8', 'KOI8')) from
mediawiki.pagecontent where old_id=1499;
convert | to_tsvector
-------------+---------------------
проба proba | 'proba':2 'проба':1
(1 row)
------

То есть в первом случае русские слова не индексируются, а во втором
проблем нет. В каком месте разложены грабли?

С уважением
Евгений

P.S. С наступившим Новым Годом всех, естественно.
>From pgsql-ru-general-owner(at)postgresql(dot)org Wed Jan 3 08:46:46 2007
Received: from localhost (maia-4.hub.org [200.46.204.183])
by postgresql.org (Postfix) with ESMTP id 8BC9B9F95E3
for <pgsql-ru-general-postgresql(dot)org(at)postgresql(dot)org>; Wed, 3 Jan 2007 08:46:45 -0400 (AST)
Received: from postgresql.org ([200.46.204.71])
by localhost (mx1.hub.org [200.46.204.183]) (amavisd-new, port 10024)
with ESMTP id 53814-07 for <pgsql-ru-general-postgresql(dot)org(at)postgresql(dot)org>;
Wed, 3 Jan 2007 08:46:38 -0400 (AST)
X-Greylist: domain auto-whitelisted by SQLgrey-1.7.4
Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.171])
by postgresql.org (Postfix) with ESMTP id A93459F938C
for <pgsql-ru-general(at)postgresql(dot)org>; Wed, 3 Jan 2007 08:46:37 -0400 (AST)
Received: by ug-out-1314.google.com with SMTP id k3so1809240ugf
for <pgsql-ru-general(at)postgresql(dot)org>; Wed, 03 Jan 2007 04:46:35 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
s=beta; d=gmail.com;
h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
b=CobbzrlFH8n8Pbed7xkOfxi8yqRq6tklYK88osOSHbS/PCIFuk3rmrGzh4uXhL/ck/T9tT7rmSAHzEeJDcrfAzTWBWiORs7meKG49flaxv2EIO0hXhz6OFRx6uKxnw5tzo9hmCYa2f+CCQaGqjrIchsSP5rmx/+jNz4G8JooPBQ=
Received: by 10.78.166.7 with SMTP id o7mr2193305hue.1167828395539;
Wed, 03 Jan 2007 04:46:35 -0800 (PST)
Received: by 10.78.164.15 with HTTP; Wed, 3 Jan 2007 04:46:35 -0800 (PST)
Message-ID: <e431ff4c0701030446r6ee8d347kc62ac3113697e87c(at)mail(dot)gmail(dot)com>
Date: Wed, 3 Jan 2007 15:46:35 +0300
From: "Nikolay Samokhvalov" <samokhvalov(at)gmail(dot)com>
Reply-To: nikolay(at)samokhvalov(dot)com
To: "Evgeny M. Baldin" <E(dot)M(dot)Baldin(at)inp(dot)nsk(dot)su>
Subject: =?KOI8-R?Q?Re:__tsearch2_=C9_unicode?=
Cc: pgsql-ru-general(at)postgresql(dot)org
In-Reply-To: <Pine(dot)LNX(dot)4(dot)64(dot)0701031220490(dot)9110(at)star(dot)inp(dot)nsk(dot)su>
MIME-Version: 1.0
Content-Type: text/plain; charset=KOI8-R; format=flowed
Content-Transfer-Encoding: base64
Content-Disposition: inline
References: <Pine(dot)LNX(dot)4(dot)64(dot)0701031220490(dot)9110(at)star(dot)inp(dot)nsk(dot)su>
X-Virus-Scanned: Maia Mailguard 1.0.1
X-Archive-Number: 200701/2
X-Sequence-Number: 595

cGdfdHNfY2ZnIC0g1yD3wdvJyCDSwdPQz9LR1sXOycksIPfBzSDOwcTPINTPzNjLzyDQ0s/B0MTF
ytTJ1NggxaMgOi0pCtXLwdrB1yDUwcvJzSDPwtLB2s/NLCDLwcvB0SDSwcLP3sHRIMvPxMnSz9fL
wSDJ09DPzNja1cXU09EuCgr3z9Qg0NLJzcXSINPP2sTBzsnRINLBws/excogy8/OxsnH1dLBw8nJ
ICLTIM7VzNEiICjL09TB1MksIMvByyDSwdoKVVRGLTgpOiBodHRwOi8vd3d3LnNhaS5tc3Uuc3Uv
fm1lZ2VyYS93aWtpL3Bnd2ViX2NmZwoKT24gMS8zLzA3LCBFdmdlbnkgTS4gQmFsZGluIDxFLk0u
QmFsZGluQGlucC5uc2suc3U+IHdyb3RlOgo+IOTPwtLZyiDExc7YCj4KPiAgIPDSxcHNwtXMwTog
0NnUwcDT2CDOwdPU0s/J1Ngg0tXT08vJyiDQz8zOz9TFy9PUztnKINDPydPLINcgTWVkaWFXaWtp
Cj4gKHdpa2ktxNfJ1s/LIMvP1M/S2cogydPQz8zY2tXF1NPRINcg98nLydDFxMnJKSDTIMnT0M/M
2NrP18HOycUg1wo+IMvB3sXT1NfFIMjSwc7JzMndwSBQb3N0Z3JlU1FMINfF0tPJySA4LjEuNAo+
Cj4gICDhzcLVzME6IE1lZGlhV2lraSDQydvF1CDUxcvT1CDX08XHxMEg1yBVVEY4LCDQz9zUz83V
LCDF09TF09TXxc7Ozwo+IM7B09TSz8nU2CB0c2VhcmNoMiwg3tTPwtkg0sHCz9TBzCDTIHVuaWNv
ZGUuIOTM0SDc1M/HzyDTz8LSwcwgySDV09TBzs/XycwKPiB0c2VhcmNoMiDJ2iBQb3N0Z3JlU1FM
IDguMiAodHNlYXJjaDIuODIudGFyLmd6INMgz8bJw8nBzNjOz8fPINPBytTBCj4gdHNlYXJjaDIp
LCDLz9TP0tnKINDPxMTF0tbJ18HF1CBVVEYtOC4g88TFzMHMIGluaXRkYiDTIMvMwN7Jy8/NCj4g
LS1sb2NhbGU9cnVfUlUuVVRGLTgKPgo+ICAg98/azsnLzM8gzsXQz87JzcHOycU6Cj4KPiB3aWtp
ZGI9IyBzaG93IGxjX2NvbGxhdGU7Cj4gICBsY19jb2xsYXRlCj4gLS0tLS0tLS0tLS0tLQo+ICAg
cnVfUlUuVVRGLTgKPiAoMSByb3cpCj4KPiB3aWtpZGI9IyBzZWxlY3QgKiBmcm9tIHBnX3RzX2Nm
ZzsKPiB0c19uYW1lICAgICB8IHByc19uYW1lIHwgICAgbG9jYWxlCj4gLS0tLS0tLS0tLS0tLS0t
LS0rLS0tLS0tLS0tLSstLS0tLS0tLS0tLS0tLQo+ICAgZGVmYXVsdF9ydXNzaWFuIHwgZGVmYXVs
dCAgfCBydV9SVS5LT0k4LVIKPiAgIHV0ZjhfcnVzc2lhbiAgICB8IGRlZmF1bHQgIHwgcnVfUlUu
VVRGLTgKPiAgIHNpbXBsZSAgICAgICAgICB8IGRlZmF1bHQgIHwKPiAgIGRlZmF1bHQgICAgICAg
ICB8IGRlZmF1bHQgIHwgcnVfUlUuVVRGLTgKPiAoNCByb3dzKQo+Cj4KPiDl09TYIM7Fy8/FINTF
y9PUz9fPxSDQz8zFIG9sZF90ZXh0INcg1MHCzMnDxSBtZWRpYXdpa2kucGFnZWNvbnRlbnQgLQo+
INrB0MnT2CDXIFVURjgsIMEgy8zJxc7UINUgzcXO0SDXIGtvaTgsINDP3NTPzdUgxMzRIM/Uz8LS
wdbFzsnRIMnT0M/M2NrVwAo+IGNvbnZlcnQKPgo+IHdpa2lkYj0jIHNlbGVjdCBjb252ZXJ0KG9s
ZF90ZXh0LCdVVEY4JywnS09JOCcpLAo+ICAgIHRvX3RzdmVjdG9yKG9sZF90ZXh0KSBmcm9tIG1l
ZGlhd2lraS5wYWdlY29udGVudAo+ICAgIHdoZXJlIG9sZF9pZD0xNTAwOwo+Cj4gICAgIGNvbnZl
cnQgICB8IHRvX3RzdmVjdG9yCj4gLS0tLS0tLS0tLS0tLSstLS0tLS0tLS0tLS0tCj4gICBwcm9i
YSDQ0s/CwSB8ICdwcm9iYSc6MQo+ICgxIHJvdykKPgo+ICEg9yDSxdrVzNjUwdTFIMnOxMXL08nS
1cXU09Eg1M/M2MvPIMzB1MnOycPBCj4KPiDhzsHMz8fJ3s7VwCDQ0s/DxcTV0tUg08TFzMHMIMvP
x8TBIGluaXRkYiDawdDV08vBzMHT2CDTIMvMwN7Jy8/NCj4gLS1sb2NhbGU9cnVfUlUuS09JOC1S
Cj4KPiB3aWtpZGI9IyBzaG93IGxjX2NvbGxhdGU7Cj4gICAgbGNfY29sbGF0ZQo+IC0tLS0tLS0t
LS0tLS0tCj4gICBydV9SVS5LT0k4LVIKPgo+IHdpa2lkYj0jIHNlbGVjdCAqIGZyb20gcGdfdHNf
Y2ZnOwo+IHRzX25hbWUgICAgIHwgcHJzX25hbWUgfCAgICBsb2NhbGUKPiAtLS0tLS0tLS0tLS0t
LS0tLSstLS0tLS0tLS0tKy0tLS0tLS0tLS0tLS0tCj4gICBkZWZhdWx0X3J1c3NpYW4gfCBkZWZh
dWx0ICB8IHJ1X1JVLktPSTgtUgo+ICAgdXRmOF9ydXNzaWFuICAgIHwgZGVmYXVsdCAgfCBydV9S
VS5VVEYtOAo+ICAgc2ltcGxlICAgICAgICAgIHwgZGVmYXVsdCAgfAo+ICAgZGVmYXVsdCAgICAg
ICAgIHwgZGVmYXVsdCAgfCBydV9SVS5LT0k4LVIKPiAoNCByb3dzKQo+Cj4KPiB3aWtpZGI9IyBz
ZWxlY3QgY29udmVydChvbGRfdGV4dCwnVVRGOCcsJ0tPSTgnKSwKPiAgICB0b190c3ZlY3Rvcihj
b252ZXJ0KG9sZF90ZXh0LCdVVEY4JywgJ0tPSTgnKSkgZnJvbQo+ICAgIG1lZGlhd2lraS5wYWdl
Y29udGVudCB3aGVyZSBvbGRfaWQ9MTQ5OTsKPiAgICAgY29udmVydCAgIHwgICAgIHRvX3RzdmVj
dG9yCj4gLS0tLS0tLS0tLS0tLSstLS0tLS0tLS0tLS0tLS0tLS0tLS0KPiAgINDSz8LBIHByb2Jh
IHwgJ3Byb2JhJzoyICfQ0s/CwSc6MQo+ICgxIHJvdykKPiAgICAgICAgICAgICAgICAgICAgICAg
ICAgIC0tLS0tLQo+Cj4g9M8gxdPU2CDXINDF0tfPzSDTzNXewcUg0tXT08vJxSDTzM/XwSDOxSDJ
zsTFy9PJ0tXA1NPRLCDBINfPINfUz9LPzQo+INDSz8LMxc0gzsXULiD3IMvBy8/NIM3F09TFINLB
2szP1sXO2SDH0sHCzMk/Cj4KPiDzINXXwdbFzsnFzQo+ICAgICAgICAg5dfHxc7Jygo+Cj4gUC5T
LiDzIM7B09TV0MnX28nNIO7P19nNIOfPxM/NINfTxcgsIMXT1MXT1NfFzs7PLgo+IC0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLShlbmQgb2YgYnJvYWRjYXN0KS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLQo+IFRJUCA3OiBZb3UgY2FuIGhlbHAgc3VwcG9ydCB0aGUgUG9zdGdyZVNRTCBwcm9q
ZWN0IGJ5IGRvbmF0aW5nIGF0Cj4KPiAgICAgICAgICAgICAgICAgaHR0cDovL3d3dy5wb3N0Z3Jl
c3FsLm9yZy9hYm91dC9kb25hdGUKPgoKCi0tIApCZXN0IHJlZ2FyZHMsCk5pa29sYXkK

Responses

Browse pgsql-ru-general by date

  From Date Subject
Next Message Evgeny M. Baldin 2007-01-03 13:07:47 Re: tsearch2 и unicode
Previous Message Nikolay Samokhvalov 2006-12-15 15:41:47 Re: SELECT ... WHERE ... IN (SELECT ...) -> SELECT ... WHERE (... OR ... )