Re: Configuring Text Search parser?

From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: jesper(at)krogh(dot)cc
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Configuring Text Search parser?
Date: 2010-09-21 17:38:32
Message-ID: 1285090712.4454.70.camel@yoffice
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Your changes are somewhat fine. It will get you tokens with "_"
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing "_", move to a state
that can identify the new token. If you finally recognize that token,
then output it.

In order to extract portions of the newly created token, you can write
a special handler for the token that resets the parser position to the
start of the token to get parts of it. And then modify the state machine
to output the part-token before going into the state that can lead to
the token that was identified earlier.

Look at these changes to the text parser as well:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg00004.php

-Sushant.

On Mon, 2010-09-20 at 16:01 +0200, jesper(at)krogh(dot)cc wrote:
> Hi.
>
> I'm trying to migrate an application off an existing Full Text Search engine
> and onto PostgreSQL .. one of my main (remaining) headaches are the
> fact that PostgreSQL treats _ as a seperation charachter whereas the existing
> behaviour is to "not split". That means:
>
> testdb=# select ts_debug('database_tag_number_999');
> ts_debug
> ------------------------------------------------------------------------------
> (asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})
> (blank,"Space symbols",_,{},,)
> (asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})
> (blank,"Space symbols",_,{},,)
> (asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})
> (blank,"Space symbols",_,{},,)
> (uint,"Unsigned integer",999,{simple},simple,{999})
> (7 rows)
>
> Where the incoming data, by design contains a set of tags which includes _
> and are expected to be one "lexeme".
>
> I've tried patching my way out of this using this patch.
>
> $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
> src/backend/tsearch/wparser_def.c
> *** src/backend/tsearch/wparser_def.c.orig 2010-09-20 15:58:37.033336460
> +0200
> --- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200
> ***************
> *** 967,986 ****
> --- 967,988 ----
>
> static const TParserStateActionItem actionTPS_InNumWord[] = {
> {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
> {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
> {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
> {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
> {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
> {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
> };
>
> static const TParserStateActionItem actionTPS_InAsciiWord[] = {
> {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
> {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
> {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
> {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
> {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
> ***************
> *** 995,1004 ****
> --- 997,1007 ----
>
> static const TParserStateActionItem actionTPS_InWord[] = {
> {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
> {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
> {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
> {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
> {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
> };
>
>
>
> This will obviously break other peoples applications, so my questions would
> be: If this should be made configurable.. how should it be done?
>
> As a sidenote... Xapian doesn't split on _ .. Lucene does.
>
> Thanks.
>
> --
> Jesper
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-09-21 17:39:57 Re: Git conversion status
Previous Message Alvaro Herrera 2010-09-21 17:32:15 Re: Git conversion status