Re: regexp_positions()

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "David Fetter" <david(at)fetter(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: regexp_positions()
Date: 2021-02-28 03:58:05
Message-ID: 4d4a4d83-efae-414b-878d-46e1b9d64738@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sun, Feb 28, 2021, at 03:13, David Fetter wrote:
> Maybe an int4multirange, which would fit unless I'm misunderstanding
> g's meaning with respect to non-overlapping patterns, but that might
> be a little too cute and not easy ever to extend.
>
> Come to that, would a row structure that looked like
>
> (match, start, end)
>
> be useful?

Nice, didn't know about the new multirange.
Its data structure seems like a perfect fit for this.

Hmm, I cannot find any catalog function to extract the ranges from the data structure though?
As a caller, I might need the exact start/end values,
not just wanting to know if a certain values overlaps any of the ranges.
Is there such a function?

Here is a PoC that just returns the start_pos and end_pos for all the matches.
It would be simple to modify it to instead return multirange.

CREATE OR REPLACE FUNCTION regexp_ranges(string text, pattern text, OUT start_pos integer, OUT end_pos integer)
RETURNS SETOF RECORD
LANGUAGE plpgsql
AS
$$
DECLARE
match text;
remainder text := string;
len integer;
BEGIN
end_pos := 0;
--
-- Ignore possible capture groups, instead just wrap the entire regex
-- in an enclosing capture group, which is then extracted as the first array element.
--
FOR match IN SELECT (regexp_matches(string,format('(%s)',pattern),'g'))[1] LOOP
len := length(match);
start_pos := position(match in remainder) + end_pos;
end_pos := start_pos + len - 1;
RETURN NEXT;
remainder := right(remainder, -len);
END LOOP;
RETURN;
END
$$;

This works fine for small strings:

SELECT * FROM regexp_ranges('aaaa aa aaa','a+');
start_pos | end_pos
-----------+---------
1 | 4
6 | 7
10 | 12
(3 rows)

Time: 0.209 ms

But quickly gets slow for longer strings:

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',10000),'a+');
20001
Time: 98.663 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',20000),'a+');
40001
Time: 348.027 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',30000),'a+');
60001
Time: 817.412 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',40000),'a+');
80001
Time: 1478.438 ms (00:01.478)

Compared to the much nicer observed O-notation for regexp_matches():

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',10000),'(a+)','g');
20001
Time: 12.602 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',20000),'(a+)','g');
40001
Time: 25.161 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',30000),'(a+)','g');
60001
Time: 44.795 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',40000),'(a+)','g');
80001
Time: 57.292 ms

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-02-28 04:18:21 Re: [HACKERS] Custom compression methods
Previous Message David Fetter 2021-02-28 02:13:48 Re: regexp_positions()