Re: old bug in full text parser

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Mike Rylander <mrylander(at)gmail(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)postgrespro(dot)ru>
Subject: Re: old bug in full text parser
Date: 2016-02-10 20:59:29
Message-ID: CAF4Au4ybGJMErZf+CRDX0Y=SRuLhGA0pi8nThjrs2-DhfJo0xQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 10, 2016 at 7:45 PM, Mike Rylander <mrylander(at)gmail(dot)com> wrote:

> On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov(at)gmail(dot)com>
> wrote:
> > It looks like there is a very old bug in full text parser (somebody
> pointed
> > me on it), which appeared after moving tsearch2 into the core. The
> problem
> > is in how full text parser process hyphenated words. Our original idea
> was
> > to report hyphenated word itself as well as its parts and ignore hyphen.
> > That was how tsearch2 works.
> >
> > This behaviour was changed after moving tsearch2 into the core:
> > 1. hyphen now reported by parser, which is useless.
> > 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed
> differently
> > than ones with plain text words like 'four-dot', no hyphenated word
> itself
> > reported.
> >
> > I think we should consider this as a bug and produce fix for all
> supported
> > versions.
> >
>
> The Evergreen project has long depended on tsearch2 (both as an
> extension and in-core FTS), and one thing we've struggled with is date
> range parsing such as birth and death years for authors in the form of
> 1979-2014, for instance. Strings like that end up being parsed as two
> lexems, "1979" and "-2014". We work around this by pre-normalizing
> strings matching /(\d+)-(\d+)/ into two numbers separated by a space
> instead of a hyphen, but if fixing this bug would remove the need for
> such a preprocessing step it would be a great help to us. Would such
> strings be parsed "properly" into lexems of the form of "1979" and
> "2014" with you proposed change?
>
>
I'd love to consider all hyphenated "words" in one way, disregarding to
what is "a word", number of plain text, namely, 'w1-w2' should be reported
as {'w1-w2', 'w1', 'w2'}. The problem is in definition of "word".

We'll definitely look on parser again, fortunately, we could just fork
default parser and develop new one to not break compatibility. You have
chance to help us to produce "consistent" view of what tokens new parser
should recognize and how process them.

> Thanks!
>
> --
> Mike Rylander
>
> > After investigation we found this commit:
> >
> > commit 73e6f9d3b61995525785b2f4490b465fe860196b
> > Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> > Date: Sat Oct 27 19:03:45 2007 +0000
> >
> > Change text search parsing rules for hyphenated words so that digit
> > strings
> > containing decimal points aren't considered part of a hyphenated
> word.
> > Sync the hyphenated-word lookahead states with the subsequent
> > part-by-part
> > reparsing states so that we don't get different answers about how
> much
> > text
> > is part of the hyphenated word. Per my gripe of a few days ago.
> >
> >
> > 8.2.23
> >
> > select tok_type, description, token from ts_debug('dot-four');
> > tok_type | description | token
> > -------------+-------------------------------+----------
> > lhword | Latin hyphenated word | dot-four
> > lpart_hword | Latin part of hyphenated word | dot
> > lpart_hword | Latin part of hyphenated word | four
> > (3 rows)
> >
> > select tok_type, description, token from ts_debug('dot-4');
> > tok_type | description | token
> > -------------+-------------------------------+-------
> > hword | Hyphenated word | dot-4
> > lpart_hword | Latin part of hyphenated word | dot
> > uint | Unsigned integer | 4
> > (3 rows)
> >
> > select tok_type, description, token from ts_debug('4-dot');
> > tok_type | description | token
> > ----------+------------------+-------
> > uint | Unsigned integer | 4
> > lword | Latin word | dot
> > (2 rows)
> >
> > 8.3.23
> >
> > select alias, description, token from ts_debug('dot-four');
> > alias | description | token
> > -----------------+---------------------------------+----------
> > asciihword | Hyphenated word, all ASCII | dot-four
> > hword_asciipart | Hyphenated word part, all ASCII | dot
> > blank | Space symbols | -
> > hword_asciipart | Hyphenated word part, all ASCII | four
> > (4 rows)
> >
> > select alias, description, token from ts_debug('dot-4');
> > alias | description | token
> > -----------+-----------------+-------
> > asciiword | Word, all ASCII | dot
> > int | Signed integer | -4
> > (2 rows)
> >
> > select alias, description, token from ts_debug('4-dot');
> > alias | description | token
> > -----------+------------------+-------
> > uint | Unsigned integer | 4
> > blank | Space symbols | -
> > asciiword | Word, all ASCII | dot
> > (3 rows)
> >
> >
> > Regards,
> > Oleg
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robbie Harwood 2016-02-10 21:06:59 Re: [PATCH v4] GSSAPI encryption support
Previous Message Robert Haas 2016-02-10 20:43:08 Re: Optimization for updating foreign tables in Postgres FDW