Re: old bug in full text parser

From: Mike Rylander <mrylander(at)gmail(dot)com>
To: obartunov(at)gmail(dot)com
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)postgrespro(dot)ru>
Subject: Re: old bug in full text parser
Date: 2016-02-10 16:45:47
Message-ID: CAO8ar==RC4o7a3Yw_AoQ=TVyH2EmZLx1PRQPGfios+XsXEr+xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:
> It looks like there is a very old bug in full text parser (somebody pointed
> me on it), which appeared after moving tsearch2 into the core. The problem
> is in how full text parser process hyphenated words. Our original idea was
> to report hyphenated word itself as well as its parts and ignore hyphen.
> That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently
> than ones with plain text words like 'four-dot', no hyphenated word itself
> reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>

The Evergreen project has long depended on tsearch2 (both as an
extension and in-core FTS), and one thing we've struggled with is date
range parsing such as birth and death years for authors in the form of
1979-2014, for instance. Strings like that end up being parsed as two
lexems, "1979" and "-2014". We work around this by pre-normalizing
strings matching /(\d+)-(\d+)/ into two numbers separated by a space
instead of a hyphen, but if fixing this bug would remove the need for
such a preprocessing step it would be a great help to us. Would such
strings be parsed "properly" into lexems of the form of "1979" and
"2014" with you proposed change?

Thanks!

--
Mike Rylander

> After investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Date: Sat Oct 27 19:03:45 2007 +0000
>
> Change text search parsing rules for hyphenated words so that digit
> strings
> containing decimal points aren't considered part of a hyphenated word.
> Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
> reparsing states so that we don't get different answers about how much
> text
> is part of the hyphenated word. Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
> tok_type | description | token
> -------------+-------------------------------+----------
> lhword | Latin hyphenated word | dot-four
> lpart_hword | Latin part of hyphenated word | dot
> lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
> tok_type | description | token
> -------------+-------------------------------+-------
> hword | Hyphenated word | dot-4
> lpart_hword | Latin part of hyphenated word | dot
> uint | Unsigned integer | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
> tok_type | description | token
> ----------+------------------+-------
> uint | Unsigned integer | 4
> lword | Latin word | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
> alias | description | token
> -----------------+---------------------------------+----------
> asciihword | Hyphenated word, all ASCII | dot-four
> hword_asciipart | Hyphenated word part, all ASCII | dot
> blank | Space symbols | -
> hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
> alias | description | token
> -----------+-----------------+-------
> asciiword | Word, all ASCII | dot
> int | Signed integer | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
> alias | description | token
> -----------+------------------+-------
> uint | Unsigned integer | 4
> blank | Space symbols | -
> asciiword | Word, all ASCII | dot
> (3 rows)
>
>
> Regards,
> Oleg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2016-02-10 16:46:39 Re: [PROPOSAL] Improvements of Hunspell dictionaries support
Previous Message Tom Lane 2016-02-10 16:36:36 Re: [COMMITTERS] pgsql: Code cleanup in the wake of recent LWLock refactoring.