Re: Regex pattern with shorter back reference does NOT work as expected

From: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Regex pattern with shorter back reference does NOT work as expected
Date: 2013-07-15 06:59:28
Message-ID: CAM2+6=UHVX=5PKYMVFy=2gTKqWdQDB0Jd-BnexG9LreTNW-gTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

On Sat, Jul 13, 2013 at 10:43 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I wrote:
> > Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com> writes:
> >> Following example does not work as expected:
> >>
> >> -- Should return TRUE but returning FALSE
> >> SELECT 'Programmer' ~ '(\w).*?\1' as t;
>
> > This is clearly broken, but I'm uncomfortable with the proposed patch.
> > As written, it changes behavior for both the shortest-match-preferred
> > and longest-match-preferred cases; but you've offered no evidence that
> > the longest-match case is broken. Maybe it is --- it's sure not
> > obvious why it's okay to abandon the search early in this case. But I
> > think we'd have been likely to hear about it before now if there were
> > a matching failure in that path, since longest-preferred is so much
> > more widely used.
>
> After reflecting on this for awhile, I think that the longest-preferred
> case is indeed also wrong in theory, but it's unreachable, which
> explains the lack of bug reports. To get to the "no point in trying
> again" code, we have to have a success of the DFA match followed by a
> failure of the cdissect match, which essentially means that the string
> would match if we didn't have to constrain some backref to exactly match
> the string matched by its referent. Now, in the longest-preferred case,
> the supposed early exit test is "end == begin", ie the tentative match
> was of zero length overall. I can't envision any situation in which a
> backref constraint could fail given that, because both the referent
> pattern piece and the backref piece would have to be matching
> zero-length substrings as well. There could be anchor constraints,
> lookahead constraints, and so forth in play, but those would all have
> been checked by the DFA, so they're not going to result in cdissect
> failures. Hence, the "end == begin" test will simply never succeed.
>

Thanks for the explanation.
For last couple of days I was trying hard to find a test-case which triggers
"end == begin" but didn't find one.
This explanation proves that it will never reach that. So I give up now.

>
> On the other hand, the check made in the shortest-preferred case is
> going to succeed if the tentative match was of maximal not minimal
> length, and that's certainly a possible case.
>
> So I think your patch is right, although I'd be inclined to refactor
> the code to have just one test on "shorter", like this:
>
> /* go around and try again, if possible */
> if (shorter)
> {
> if (end == estop)
> break;
> estart = end + 1;
> }
> else
> {
> if (end == begin)
> break;
> estop = end - 1;
> }
>
> so as to make it clearer that we're just defending against selecting an
> impossible new estart or estop location. (Although I argued above that
> the "end == begin" test can't succeed, I wouldn't care to remove it
> entirely here.)
>

OK. Looks good to me.

Thanks

>
> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2013-07-15 07:58:49 Re: Proposal - Support for National Characters functionality
Previous Message Peter Geoghegan 2013-07-15 06:36:58 Re: Proposal - Support for National Characters functionality