Re: WIP: lookbehind constraints for our regexp engine

From: Thom Brown <thombrown(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: lookbehind constraints for our regexp engine
Date: 2015-10-17 08:19:26
Message-ID: CAA-aLv5Etwt9t1yQ7AJDewfkht1_+_hyCBxbWrZ--B5AkuuU6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Oct 17, 2015 12:53 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I've occasionally heard complaints that our regex engine only has
> lookahead constraints not lookbehind constraints, while Perl's for example
> does have those. While I was fooling around with the other issues in that
> code, I learned enough to realize that it would not be that hard to
> implement such things, and the attached proof-of-concept patch does so
> (using Perl-compatible syntax, in case you're wondering).
>
> However, I don't feel like this is done to the point of being committable,
> because its performance leaves something to be desired in a lot of cases.
> The difficulty is that because the engine can only match in a forward
> direction, in the worst case you have to check a lookbehind constraint by
> trying to match starting from the string start to see if a match exists
> ending at the current-location where you're testing the constraint. That
> makes it, worst case, O(N^2) to search a string of length N. Now, you can
> improve on that greatly if you can determine that possible matches don't
> extend all the way back to the string start. In the attached patch I use
> the "cold start pointer" returned by shortest() to decide that characters
> before the coldstart point no longer have to be rechecked. That helps
> some, but it's not good enough because there are too many cases where the
> engine doesn't move the coldstart point forward very aggressively.
>
> It might be that that behavior itself can be improved on, which would
> be nice because there are also non-lookbehind-related scenarios where
> smarter coldstart detection would help performance. But I have no very
> good ideas about how to do it.
>
> Another idea I've been toying with is to add logic that tries to determine
> the maximum possible match length for a regex. If we can determine that
> for the lookbehind sub-regex, then we'd know we have to back up at most
> that far before applying the match check. This seems promising because
> a lot of other engines don't even support variable-length lookbehind
> patterns at all, and so we'd be assured of good performance in all the
> cases that competitors can handle at all.
>
> Anyway, I'm not planning to do much more work on this right now, but
> I thought I'd throw it out there just to let people know that this seems
> within reach. I'm curious to know how many people care, and how much.

+1

I'm one of those who wanted lookbehind, and it would give us complete
lookaround so very much welcome.

Thom

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2015-10-17 10:04:43 Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission denied”
Previous Message Amit Kapila 2015-10-17 06:44:20 Re: Parallel Seq Scan