Re: old bug in full text parser

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)postgrespro(dot)ru>
Subject: Re: old bug in full text parser
Date: 2016-02-10 10:04:07
Message-ID: CAF4Au4xrkE5yHbNDBg+0Cn0VLKm9c+SD13No0yUix483_F2bvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:

> It looks like there is a very old bug in full text parser (somebody
> pointed me on it), which appeared after moving tsearch2 into the core. The
> problem is in how full text parser process hyphenated words. Our original
> idea was to report hyphenated word itself as well as its parts and ignore
> hyphen. That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed
> differently than ones with plain text words like 'four-dot', no hyphenated
> word itself reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>
> After investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Date: Sat Oct 27 19:03:45 2007 +0000
>
> Change text search parsing rules for hyphenated words so that digit
> strings
> containing decimal points aren't considered part of a hyphenated word.
> Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
> reparsing states so that we don't get different answers about how much
> text
> is part of the hyphenated word. Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
> tok_type | description | token
> -------------+-------------------------------+----------
> lhword | Latin hyphenated word | dot-four
> lpart_hword | Latin part of hyphenated word | dot
> lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
> tok_type | description | token
> -------------+-------------------------------+-------
> hword | Hyphenated word | dot-4
> lpart_hword | Latin part of hyphenated word | dot
> uint | Unsigned integer | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
> tok_type | description | token
> ----------+------------------+-------
> uint | Unsigned integer | 4
> lword | Latin word | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
> alias | description | token
> -----------------+---------------------------------+----------
> asciihword | Hyphenated word, all ASCII | dot-four
> hword_asciipart | Hyphenated word part, all ASCII | dot
> blank | Space symbols | -
> hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
> alias | description | token
> -----------+-----------------+-------
> asciiword | Word, all ASCII | dot
> int | Signed integer | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
> alias | description | token
> -----------+------------------+-------
> uint | Unsigned integer | 4
> blank | Space symbols | -
> asciiword | Word, all ASCII | dot
> (3 rows)
>
>

Oh, one more bug, which existed even in tsearch2.

select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)

>
> Regards,
> Oleg
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-02-10 12:12:36 Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Previous Message Andres Freund 2016-02-10 09:54:39 Re: Relation extension scalability