| From: | Brian DeRocher <brian(at)derocher(dot)org> | 
|---|---|
| To: | pgsql-general(at)postgresql(dot)org | 
| Subject: | to_tsvector() with hyphens | 
| Date: | 2015-07-06 16:30:27 | 
| Message-ID: | 2437094.FICmWb5XyZ@bregalad | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Hey everyone,
I think it's great that the full text search parser breaks hyphenated words into multiple parts. I think this really could help, but something is not right.
rasmas_hackathon=> select * from ts_debug( 'gn-foo' );
      alias      |           description           |  token  |  dictionaries  |  dictionary  | lexemes  
-----------------+---------------------------------+---------+----------------+--------------+----------
 asciihword      | Hyphenated word, all ASCII      | gn-foo  | {english_stem} | english_stem | {gn-foo}
 hword_asciipart | Hyphenated word part, all ASCII | gn      | {english_stem} | english_stem | {gn}
 blank           | Space symbols                   | -       | {}             |              | 
 hword_asciipart | Hyphenated word part, all ASCII | foo     | {english_stem} | english_stem | {foo}
 blank           | Space symbols                   |         | {}             |              | 
(6 rows)
But why does to_tsquery() AND them?
rasmas_hackathon=> select * from to_tsquery( 'gn-foo | bandage' );
             to_tsquery             
------------------------------------
 'gn-foo' & 'gn' & 'foo' | 'bandag'
(1 row)
Perhaps my vector is like this:
rasmas_hackathon=> select to_tsvector( 'gn series bandage' );
         to_tsvector         
-----------------------------
 'bandag':3 'gn':1 'seri':2
(1 row)
The rank is so bad.
rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn-foo | bandage' ) );
 ts_rank_cd 
------------
        0.1
(1 row)
Without the hyphen the rank is better, despite the process above.
rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn | bandage' ) );
 ts_rank_cd 
------------
        0.2
(1 row)
So wouldn't this be a better query for hyphenated words?
'gn-foo' | 'gn' | 'foo'
Aside: Best i can tell the parser is giving instructions to pushval_morph() to treat hyphenated words as 
"same variants".
thanks,
Brian
-- 
http://brian.derocher.org
http://mappingdc.org
http://about.me/brian.derocher
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2015-07-06 16:36:02 | Re: to_tsvector() with hyphens | 
| Previous Message | Mark Morgan Lloyd | 2015-07-06 16:05:59 | Re: [pg_hba.conf] publish own Python application using PostgreSQL |