Re: Fulltext search configuration

From: Mohamed <mohamed5432154321(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Fulltext search configuration
Date: 2009-02-02 16:09:19
Message-ID: 861fed220902020809u534b743atba3491397f27b404@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:

> On Mon, 2 Feb 2009, Oleg Bartunov wrote:
>
> On Mon, 2 Feb 2009, Mohamed wrote:
>>
>> Hehe, ok..
>>> I don't know either but I took some lines from Al-Jazeera :
>>> http://aljazeera.net/portal
>>>
>>> just made the change you said and created it successfully and tried this
>>> :
>>>
>>> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ????
>>> ?????????
>>> ?????')
>>>
>>> but I got nothing... :(
>>>
>>
>> Mohamed, what did you expect from ts_lexize ? Please, provide us valuable
>> information, else we can't help you.
>>
>
What I expected was something to be returned. After all they are valid words
taken from an article. (perhaps you don't see the words, but only ???... )
Am I wrong to expect something ? Should I go for setting up the
configuration completly first?

SELECT ts_lexize('norwegian_ispell',
'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}

Check out this article if you need a sample.
http://www.aljazeera.net/NR/exeres/103CFC06-0195-47FD-A29F-2C84B5A15DD0.htm

>
>>
>>> Is there a way of making sure that words not recognized also gets
>>> indexed/searched for ? (Not that I think this is the problem)
>>>
>>
>> yes
>>
>
> Read
> http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
> "A text search configuration binds a parser together with a set of
> dictionaries to process the parser's output tokens. For each token type that
> the parser can return, a separate list of dictionaries is specified by the
> configuration. When a token of that type is found by the parser, each
> dictionary in the list is consulted in turn, until some dictionary
> recognizes it as a known word. If it is identified as a stop word, or if no
> dictionary recognizes the token, it will be discarded and not indexed or
> searched for. The general rule for configuring a list of dictionaries is to
> place first the most narrow, most specific dictionary, then the more general
> dictionaries,
> finishing with a very general dictionary, like a Snowball stemmer or
> simple, which recognizes everything."
>

Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words
that are words but for some reason is not recognized "it will be discarded
and not indexed or searched for." which I consider a problem since I don't
trust my configuration to cover everything.

Is this not a valid concern?

>
> quick example:
>
> CREATE TEXT SEARCH CONFIGURATION arabic (
> COPY = english
> );
>
> =# \dF+ arabic
> Text search configuration "public.arabic"
> Parser: "pg_catalog.default"
> Token | Dictionaries
> -----------------+--------------
> asciihword | english_stem
> asciiword | english_stem
> email | simple
> file | simple
> float | simple
> host | simple
> hword | english_stem
> hword_asciipart | english_stem
> hword_numpart | simple
> hword_part | english_stem
> int | simple
> numhword | simple
> numword | simple
> sfloat | simple
> uint | simple
> url | simple
> url_path | simple
> version | simple
> word | english_stem
>
> Then you can alter this configuration.

Yes, I figured thats the next step but thought I should get the lexize to
work first? What do you think?

Just a thought, say I have this :

ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem;

is it possible to keep adding dictionaries, to get both arabic and english
matches on the same column (arabic people tend to mix), like this :

ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell,
english_stem;

Will something like that work ?

/ Moe

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Thomas Kellerer 2009-02-02 16:33:25 Re: Warm Standby question
Previous Message Oleg Bartunov 2009-02-02 15:34:22 Re: Fulltext search configuration