Re: old bug in full text parser

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)postgrespro(dot)ru>
Subject: Re: old bug in full text parser
Date: 2016-02-10 21:27:28
Message-ID: CAF4Au4wDNtwMqxK5S602eSxfmdXMz31SjeWM+g1gmzOZVCzcxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 10, 2016 at 7:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Oleg Bartunov <obartunov(at)gmail(dot)com> writes:
> > It looks like there is a very old bug in full text parser (somebody
> > pointed me on it), which appeared after moving tsearch2 into the core.
> The
> > problem is in how full text parser process hyphenated words. Our original
> > idea was to report hyphenated word itself as well as its parts and ignore
> > hyphen. That was how tsearch2 works.
>
> > This behaviour was changed after moving tsearch2 into the core:
> > 1. hyphen now reported by parser, which is useless.
> > 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed
> differently
> > than ones with plain text words like 'four-dot', no hyphenated word
> itself
> > reported.
>
> > I think we should consider this as a bug and produce fix for all
> supported
> > versions.
>
> I don't see anything here that looks like a bug, more like a definition
> disagreement. As such, I'd be pretty dubious about back-patching a
> change. But it's hard to debate the merits when you haven't said exactly
> what you'd do instead.
>

Yeah, better say not bug, but inconsistency. We definitely should work on
better
"consistent" parser with predicted behaviour.

>
> I believe the commit you mention was intended to fix this inconsistency:
>
> http://www.postgresql.org/message-id/6269.1193184058@sss.pgh.pa.us
>
> so I would be against simply reverting it. In any case, the examples
> given there make it look like there was already inconsistency about mixed
> words and numbers. Do we really think that "4-dot" should be considered
> a hyphenated word? I'm not sure.
>

I agree, that we shouldn't just revert it. My idea is to work on new
parser and leave old as is for compatibility reason. Fortunately, fts is
flexible enough, so we could add new parser at any time as an extension.

>
> regards, tom lane
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-02-10 21:39:59 Re: Moving responsibility for logging "database system is shut down"
Previous Message Robert Haas 2016-02-10 21:07:39 Re: Moving responsibility for logging "database system is shut down"