Re: Tsearch2 custom dictionaries

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: psql-mail(at)freeuk(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tsearch2 custom dictionaries
Date: 2003-08-07 17:13:39
Message-ID: Pine.GSO.4.56.0308072106070.17880@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 7 Aug 2003 psql-mail(at)freeuk(dot)com wrote:

> > On Thu, 7 Aug 2003 psql-mail(at)freeuk(dot)com wrote:
> >
> > > Part1.
> > >
> > > I have created a dictionary called 'webwords' which checks all
> words
> > > and curtails them to 300 chars (for now)
> > >
> > > after running
> > > make
> > > make install
> > >
> > > I then copied the lib_webwords.so into my $libdir
> > >
> > > I have run
> > >
> > > psql mybd < dict_webwords.sql
> > >
> > Once you did 'psql mybd < dict_webwords.sql' you should be able use
> it :)
> > Test it :
> > select lexize('webwords','some_web_word');
>
> I did test it with
> select lexize('webwords','some_web_word');
> lexize
> -------
> {some_web_word}
>
> select lexize('webwords','some_400char_web_word');
> lexize
> --------
> {some_shortened_web_word}
>
>
> so that bit works, but then I tried
>
> SELECT to_tsvector( 'webwords', 'my words' );
> Error: No tsearch config

from ref.guide:
to_tsvector( [configuration,] document TEXT) RETURNS tsvector

>
> > Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict
>
> yeah, i did read it - its good!
> should i run:
> update pg_ts_cfgmap set dict_name='{webwords}';
>

after loading your dictionary to db you should have it registered in
pg_ts_dict, try

select * from pg_ts_dict;

next, you need to read docs, for example
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
how to create your configuration and specify lexem_type-dictionary
mapping;

>
>
> > > Part2.
> <snip>
> > > As the text can be multilingual I don't think stemming is possible?
>
> >
> > You're right. I'm afraid you need UTF database, but tsearch2 isn't
> > UTF-8 compatible :(
>
> My database was created as unicode - does this mean I cannot use
> tsaerch?!
>

We have no any experience with UTF, so you may better ask openfts mailing
list and read archives.

> > > I also need to include many none-standard words in the index such
> as
> > > urls and message ID's contained in the text.
> > >
> >
> > What's message ID ? Integer ? it's already recognized by parser.
> >
> > try
> > select * from token_type();
> >
> > Also, last version of tsearch2 (for 7.3 grab from
> > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
> > for 7.4 - available from CVS)
> > has rather useful function - ts_debug
> >
> > apod=# select * from ts_debug('http://www.sai.msu.su/~megera');
> > ts_name | tok_type | description | token | dict_name |
> tsvector
> > ---------+----------+-------------+----------------+-----------+------
> ------------
> > simple | host | Host | www.sai.msu.su | {simple} | 'www.
> sai.msu.su'
> > simple | lword | Latin word | megera | {simple} | '
> megera'
> > (2 rows)
> >
> >
> >
> > > I get the feeling that building these indexs will by no means be an
>
> > > easy task so any suggestions will be gratefully recieved!
> > >
> >
> > You may write your own parser, at last. Some info about parser API:
> > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief
>
>
> Parser writing...scary stuff :-)
>
>
> Thanks!
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Marc G. Fournier 2003-08-07 17:49:43 Testing gateway
Previous Message DeJuan Jackson 2003-08-07 16:36:28 Re: ext3 block size