From: | Mason Hale <masonhale(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Help with tokenization of age-ranges in full text search |
Date: | 2015-02-26 03:05:57 |
Message-ID: | CAAFdBmQzXMOUm2b_JgmhanRiJPOG7oTBvrfRYowWhQ2XqGTHFA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello, I've got a 9.3 database hosted at Heroku.
I'm full text search to search for "group names" in part of my application,
and some of my group names are the names of youth sports age groups like
"Boys 9-10" or "Girls 11-12".
I would like for a search for the terms "Boys", "Boys 9-10", "9", "10" or
"9-10" to match "Boys 9-10".
I have created a custom dictionary and search configuration as follows:
CREATE TEXT SEARCH DICTIONARY public.simple_nostem_no_stop (
TEMPLATE = pg_catalog.simple
);
CREATE TEXT SEARCH CONFIGURATION public.user_search ( COPY =
pg_catalog.simple );
ALTER TEXT SEARCH CONFIGURATION public.user_search
ALTER MAPPING FOR email, asciiword, asciihword, hword_asciipart,
hword, hword_part, word WITH simple_nostem_no_stop;
Which results in this configuration:
development=# \dF+ public.user_search
Text search
configuration "public.user_search"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+-----------------------
asciihword | simple_nostem_no_stop
asciiword | simple_nostem_no_stop
blank | simple
email | simple_nostem_no_stop
file | simple
float | simple
host | simple
hword | simple_nostem_no_stop
hword_asciipart | simple_nostem_no_stop
hword_numpart | simple_nostem_no_stop
hword_part | simple_nostem_no_stop
int | simple
numhword | simple_nostem_no_stop
numword | simple_nostem_no_stop
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | simple_nostem_no_stop
Testing my query, I get the following tokenization:
development=# select alias, token from ts_debug('public.user_search', 'Boys
9-10');
alias | token
-----------+-------
asciiword | Boys
blank |
uint | 9
int | -10
(4 rows)
I was expecting / hoping to seek tokens for "9-10" and "10".
With the above a search for "9-10" or "10" would not match the term "Boys
9-10".
I was expecting the hnumword or hword_numpart token_types to match, but
they appear to require a leading alpha character to match.
If I add a leading alpha character, it tokenizes the way I would like:
development=# select alias, token from ts_debug('public.user_search', 'Boys
x9-y10');
alias | token
---------------+--------
asciiword | Boys
blank |
numhword | x9-y10
hword_numpart | x9
blank | -
hword_numpart | y10
(6 rows)
So my question is -- can I get the tokenization that I want out of a
configuration of the stock available token types?
Since I'm hosting my db on Heroku, I assume a custom parser extension is
not a possibility.
I've tried remove the uint and int mappings, hoping then that the other
parsers will pick up the slack, but that just results in the terms being
dropped altogether.
For example, if I run:
ALTER TEXT SEARCH CONFIGURATION public.user_search
DROP MAPPING IF EXISTS FOR file, float, host, int, uint, sfloat,
url_path, version, blank, url;
Then my configuration now looks like this:
development=# \dF+ public.user_search
Text search configuration "public.user_search"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+-----------------------
asciihword | simple_nostem_no_stop
asciiword | simple_nostem_no_stop
email | simple_nostem_no_stop
hword | simple_nostem_no_stop
hword_asciipart | simple_nostem_no_stop
hword_numpart | simple_nostem_no_stop
hword_part | simple_nostem_no_stop
numhword | simple_nostem_no_stop
numword | simple_nostem_no_stop
word | simple_nostem_no_stop
But now the tokens are just dropped altogether:
development=# select alias, token, lexemes from
ts_debug('public.user_search', 'Boys 9-10');
alias | token | lexemes
-----------+-------+---------
asciiword | Boys | {boys}
blank | |
uint | 9 |
int | -10 |
(4 rows)
Thanks in advance for any advice.
Would love to find a simple solution.
Mason
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2015-02-26 04:36:32 | Re: Help with tokenization of age-ranges in full text search |
Previous Message | Emanuel Araújo | 2015-02-26 02:14:34 | Triggers Operations |