Re: Tsearch2 Dutch snowball stemmer in PG8.1

From: Alban Hertroys <a(dot)hertroys(at)magproductions(dot)nl>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Tsearch2 Dutch snowball stemmer in PG8.1
Date: 2007-10-03 13:35:52
Message-ID: 47039AB8.9010800@magproductions.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Oleg Bartunov wrote:
> Alban,
>
> the documentation you're refereed on is for upcoming 8.3 release.
> For 8.1 and 8.2 you need to do all machinery by hand. It's not
> difficult, for example:

Thanks Oleg.
I think I managed to do this right, although I had to google for some of
the files (we don't have ispell installed).

You also seem to have mixed russion and english dictionaries in your
example, I'm not sure that was on purpose?

Anyway, I changed your example to use dutch dictionaries and locale
where I thought it applicable, and I got something working apparently.
Quite some guess work was involved, so I have a few questions left.

The only odd thing is that to_tsvector('dutch', 'some dutch text') now
returns '|' for stop words...

For example:
select to_tsvector('nederlands', 'De beste stuurlui staan aan wal');
to_tsvector
------------------------------------------------
'|':1,5 'bes':2 'wal':6 'staan':4 'stuurlui':3

A minor nit... You ended the script with a hidden commit (END;). I would
have preferred to experiment with the results a bit before commiting...

I mixed in a few questions below, if you could answer them please?

> -- sample tsearch2 configuration for search.postgresql.org
> -- Creates configuration 'pg' - default, should match server's locale !!!
> -- Change 'ru_RU.UTF-8'
>
> begin;
>
> -- create special (default) configuration 'pg'
> update pg_ts_cfg set locale=NULL where locale = 'ru_RU.UTF-8';

I suppose this disables a possibly existing stemmer for that locale?

> insert into pg_ts_cfg values('pg','default','ru_RU.UTF8');
>
> -- register 'pg_dict' dictionary using synonym template
> -- postgres pg
> -- pgsql pg
> -- postgresql pg
> insert into pg_ts_dict
> (select 'pg_dict',dict_init,
> '/usr/local/pgsql-dev/share/contrib/pg_dict.txt',
> dict_lexize, 'pg-specific dictionary'
> from pg_ts_dict
> where dict_name='synonym'
> );
>
> -- register ispell dictionary, check paths and stop words
> -- I used iconv for english files, since there are some cyrillic stuff
> insert into pg_ts_dict
> (SELECT 'en_ispell', dict_init,
> 'DictFile="/usr/local/share/dicts/ispell/utf8/english-utf8.dict",'
> 'AffFile="/usr/local/share/dicts/ispell/utf8/english-utf8.aff",'
> 'StopFile="/usr/local/share/dicts/ispell/utf8/english-utf8.stop"',
> dict_lexize
> FROM pg_ts_dict
> WHERE dict_name = 'ispell_template'
> );

I actually use a .lat file here. I have no idea whether that's
compatible (but it appears to have worked).

I got my .lat and .aff files from:
http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Dutch-dicts

My stop words file is from:
http://snowball.tartarus.org/algorithms/dutch/stop.txt

> -- use the same stop-word list as 'en_ispell' dictionary
> UPDATE pg_ts_dict set dict_initoption='/usr/local/share/dicts/english.stop'
> where dict_name='en_stem';

Why change the stop words for the English dictionary? I skipped this
step. Is that right?

> -- default token<->dicts mappings
> insert into pg_ts_cfgmap select 'pg', tok_alias, dict_name from
> public.pg_ts_cfgmap where ts_name='default';
>
> -- modify mappings for latin words for configuration 'pg'
> update pg_ts_cfgmap set dict_name = '{pg_dict,en_ispell,en_stem}'
> where tok_alias in ( 'lword', 'lhword', 'lpart_hword' )
> and ts_name = 'pg';
>
> -- we won't index/search some tokens
> update pg_ts_cfgmap set dict_name = NULL
> --where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float','word')
> where tok_alias in ('email', 'url', 'sfloat', 'uri', 'float')
> and ts_name = 'pg';
>
> end;
>
> -- testing
>
> select * from ts_debug('
> PostgreSQL, the highly scalable, SQL compliant, open source
> object-relational
> database management system, is now undergoing beta testing of the next
> version of our software: PostgreSQL 8.2.
> ');
>
>
> Oleg

--
Alban Hertroys
a(dot)hertroys(at)magproductions(dot)nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
7500 AK Enschede

// Integrate Your World //

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alban Hertroys 2007-10-03 14:05:57 Re: Tsearch2 Dutch snowball stemmer in PG8.1
Previous Message Michael Glaesemann 2007-10-03 12:55:46 Re: Find min year and min value