Configuring Text Search parser?

From: jesper(at)krogh(dot)cc
To: pgsql-hackers(at)postgresql(dot)org
Subject: Configuring Text Search parser?
Date: 2010-09-20 14:01:08
Message-ID: 1a26550c0b55c0a0af0dcbd8e080bc82.squirrel@shrek.krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi.

I'm trying to migrate an application off an existing Full Text Search engine
and onto PostgreSQL .. one of my main (remaining) headaches are the
fact that PostgreSQL treats _ as a seperation charachter whereas the existing
behaviour is to "not split". That means:

testdb=# select ts_debug('database_tag_number_999');
ts_debug
------------------------------------------------------------------------------
(asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})
(blank,"Space symbols",_,{},,)
(asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})
(blank,"Space symbols",_,{},,)
(asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})
(blank,"Space symbols",_,{},,)
(uint,"Unsigned integer",999,{simple},simple,{999})
(7 rows)

Where the incoming data, by design contains a set of tags which includes _
and are expected to be one "lexeme".

I've tried patching my way out of this using this patch.

$ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
src/backend/tsearch/wparser_def.c
*** src/backend/tsearch/wparser_def.c.orig 2010-09-20 15:58:37.033336460
+0200
--- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200
***************
*** 967,986 ****
--- 967,988 ----

static const TParserStateActionItem actionTPS_InNumWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
{p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
{p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
+ {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
{NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
};

static const TParserStateActionItem actionTPS_InAsciiWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
{p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
+ {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
***************
*** 995,1004 ****
--- 997,1007 ----

static const TParserStateActionItem actionTPS_InWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
{p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
{p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
+ {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
{p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
{NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
};

This will obviously break other peoples applications, so my questions would
be: If this should be made configurable.. how should it be done?

As a sidenote... Xapian doesn't split on _ .. Lucene does.

Thanks.

--
Jesper

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2010-09-20 14:09:51 Re: Serializable Snapshot Isolation
Previous Message Robert Haas 2010-09-20 13:15:42 Re: Configuring synchronous replication