Re: integrated tsearch doesn't work with non utf8 database

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: integrated tsearch doesn't work with non utf8 database
Date: 2007-09-08 06:19:38
Message-ID: Pine.LNX.4.64.0709081017260.2767@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 7 Sep 2007, Heikki Linnakangas wrote:

> Pavel Stehule wrote:
>> postgres=# select ts_debug('cs','PЪЪЪЪliЪЪ ЪЪluЪЪouЪЪkЪЪ kЪЪЪЪ se napil ЪЪlutЪЪ vody');
>> ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
>> CONTEXT: SQL function "ts_debug" statement 1
>
> I can reproduce that. In fact, you don't need the custom config or
> dictionary at all:
>
> postgres=# CREATE DATABASE latin2 encoding='latin2';
> CREATE DATABASE
> postgres=# \c latin2
> You are now connected to database "latin2".
> latin2=# select ts_debug('simple','foo');
> ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
> CONTEXT: SQL function "ts_debug" statement 1
>
> It fails trying to lexize the string using the danish snowball stemmer,
> because the danish stopword file contains character 'ЪЪ' which doesn't
> have an equivalent in LATIN2.
>
> Now what the heck is it doing with the danish stemmer, you might ask.
> ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
> behind it, I get this plan:
>
> latin2=# \i foo.sql
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=2.80..1134.45 rows=80 width=100)
> Hash Cond: (parse.tokid = tt.tokid)
> InitPlan
> -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4)
> Filter: (oid = 3748::oid)
> -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4)
> Filter: (oid = 3748::oid)
> -> Function Scan on ts_parse parse (cost=0.00..12.50 rows=1000
> width=36)
> -> Hash (cost=0.20..0.20 rows=16 width=68)
> -> Function Scan on ts_token_type tt (cost=0.00..0.20 rows=16
> width=68)
> SubPlan
> -> Limit (cost=7.33..7.36 rows=1 width=36)
> -> Subquery Scan dl (cost=7.33..7.36 rows=1 width=36)
> -> Sort (cost=7.33..7.34 rows=1 width=8)
> Sort Key: m.mapseqno
> -> Seq Scan on pg_ts_config_map m
> (cost=0.00..7.32 rows=1 width=8)
> Filter: ((ts_lexize(mapdict, $1) IS NOT
> NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))
> -> Sort (cost=6.57..6.57 rows=1 width=8)
> Sort Key: m.mapseqno
> -> Seq Scan on pg_ts_config_map m (cost=0.00..6.56 rows=1
> width=8)
> Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
> (21 rows)
>
> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
> $1). That means that it will call ts_lexize on every dictionary, which
> will try to load every dictionary. And loading danish_stem dictionary
> fails in latin2 encoding, because of the problem with the stopword file.
>
> We could rewrite ts_debug as a C-function, so that it doesn't try to

ts_debug currently doesn't work well with thesaurus dictionary, so it
certainly needs to be rewritten in C. We left rewriting it for future.

> access any unnecessary dictionaries. It seems wrong to install
> dictionaries in databases where they won't work in the first place, but
> I don't see an easy fix for that. Any comments or better ideas?
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message apoc9009 2007-09-08 08:39:19 Re: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Previous Message Tom Lane 2007-09-08 03:34:32 Re: apparent tsearch breakage on 64-bit machines