Re: Our regex vs. POSIX on "longest match"

From: Brendan Jurd <direvus(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Our regex vs. POSIX on "longest match"
Date: 2012-03-04 11:55:33
Message-ID: CADxJZo3qSSOwT5aMbzLtp+uP6+HzoDjjvxYL_L2MvOKnt1oiWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 March 2012 17:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Brendan Jurd <direvus(at)gmail(dot)com> writes:
>> I'll admit that this is a pretty obscure point, but we do appear to be
>> in direct violation of POSIX here.
>
> How so?  POSIX doesn't contain any non-greedy constructs.  If you use
> only the POSIX-compatible greedy constructs, the behavior is compliant.

While it's true that POSIX doesn't contemplate non-greed, after
reading the spec I would have expected an expression *as a whole* to
still prefer the longest match, insofar as that is possible while
respecting non-greedy particles. I just ran some quick experiments in
Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'. All returned
'y1234', which to my mind is the most obvious answer, and one which
still makes sense in light of what POSIX has to say. Whereas Postgres
(and presumably Tcl) return 'y1'.

What I found surprising here is that our implementation allows an
earlier quantifier to clobber the greediness of a later quantifier.
There's a certain disregard for the intentions of the author of the
regex. If I had wanted the whole expression to be non-greedy, I could
have written 'y*?\d+?'. But since I wrote 'y*?\d+', it is clear that
I meant for the first atom to be non-greedy, and for the second to be
greedy.

> The issue that is obscure is, once you define some non-greedy
> constructs, how to define how they should act in combination with greedy
> ones.  I'm not sure to what extent the engine's behavior is driven by
> implementation restrictions and to what extent it's really the sanest
> behavior Henry could think of.  I found a comment from him about it:
> http://groups.google.com/group/comp.lang.tcl/msg/c493317cc0d10d50
> but it's short on details as to what alternatives he considered.

Thanks for the link; it is always good to get more insight into
Henry's approach. I'm beginning to think I should just start reading
everything he ever posted to comp.lang.tcl ...

Cheers,
BJ

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-03-04 13:02:57 Re: RFC: Making TRUNCATE more "MVCC-safe"
Previous Message Magnus Hagander 2012-03-04 11:26:56 Re: xlog location arithmetic