Re: tsearch2 keep throw-away characters

From: "Ivan Zolotukhin" <ivan(dot)zolotukhin(at)gmail(dot)com>
To: Kimball <kbighorse(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: tsearch2 keep throw-away characters
Date: 2007-05-20 05:34:56
Message-ID: 751e56400705192234t33abf55s44e2f3aa7c6746ac@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

Your problem is not about stop words, it's about the fact that tsearch
parser treats '+' and '#' symbols as a lexemes of a blank type (use
ts_debug() function to figure it out) and drops it without any further
processing. AFAIK, typical solution for this is to rewrite your text
and then queries to some auxiliary words, like 'SYScpp' and
'SYScsharp', that will be included in tsvectors and indexed without
any problems. Usually you can do replacements in tsvector trigger when
indexing documents and via query rewriting (in tsearch or your
application) when quering database.

Trivial examples:

test=# select to_tsvector('english','I know how to code in SYScsharp,
java and SYScpp');
to_tsvector
------------------------------------------------------
'code':5 'java':8 'know':2 'syscpp':10 'syscsharp':7
(1 row)

and, sure:

test=# select 'I know how to code in SYScsharp, java and SYScpp' @@ 'SYScpp';
?column?
----------
t
(1 row)

There might be more sophisticated solution like prevent parser from
treating '++' as a blank lexemes, but Oleg will explain this much
better, as soon as he has time.

--
Regards,
Ivan

On 5/16/07, Kimball <kbighorse(at)gmail(dot)com> wrote:
>
> postgres=# select to_tsvector('default','I know how to code in C#, java and
> C++');
> to_tsvector
> -------------------------------------
> 'c':7,10 'code':5 'java':8 'know':2
> (1 row)
>
> postgres=# select to_tsvector('simple','I know how to code in C#, java and
> C++');
> to_tsvector
> -------------------------------------------------------------------------
> 'c':7,10 'i':1 'in':6 'to':4 'and':9 'how':3 'code':5 'java':8 'know':2
> (1 row)
>
>
> I'd like to get lexemes/tokens 'c#' and 'c++' out of this query. Everything
> I can find has to do with stop words. How do I keep characters that
> tsearch throws out? I've already tried 'c\#' and 'c\\#' etc, which don't
> work.
>
> Kimball

In response to

Browse pgsql-general by date

  From Date Subject
Next Message novnov 2007-05-20 06:34:27 Trigger function which inserts into table; values from lookup
Previous Message Tom Lane 2007-05-20 01:42:24 Re: FULL JOIN is only supported with merge-joinable join conditions