Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Gilles Darold" <gilles(at)darold(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Mark Dilger" <mark(dot)dilger(at)enterprisedb(dot)com>, "Postgres hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Andreas Karlsson" <andreas(at)proxel(dot)se>, "David Fetter" <david(at)fetter(dot)org>
Subject: Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]
Date: 2021-03-05 10:37:35
Message-ID: 3eaad17f-2988-4e65-903c-b784d30dfe8e@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 4, 2021, at 17:55, Gilles Darold wrote:
> I also think that it should return a setof 2-D integer array, an other
> solution is to return all start/end positions of an occurrence chained
> in an integer array {start1,end1,start2,end2,..}.

Hmm. Seems like we've in total managed to come up with three flawed ideas.

Pros/cons I see:

Idea #1: setof 2-D integer array
+ Packs the values into one single value.
- Difficult to work with 2-D arrays, doesn't work well with unnest(), has to inspect the dims and use for loops to extract values.
- Looking at a 2-D value, it's not obvious what the integer values means in it means. Which one is "startpos" and do we have "length" or "endpos" values?

Idea #2: setof (start_pos integer[], end_pos integer[])
+ It's obvious to the user what type of integers "start_pos" and "end_pos" contain.
- Decouples the values into two separate values.
- Tom mentioned some bad experiences with separate array values. (Details on this would be interesting.)

Idea #3: chained integer array {start1,end1,start2,end2,..}
- Mixes different values into the same value
- Requires maths (although simple calculations) to extract values

I think all three ideas (including mine) are ugly. None of them is wart free.

Idea #4: add a new composite built-in type.

A simple composite type with two int8 fields.

The field names seems to vary a lot between languages:

Rust: "start", "end" [1]
C++: "begin", "end" [2]
Python: "start", "stop" [3]

Such a simple composite type, could then always be used,
when you want to represent simple integer ranges,
between two exact values, arguably a very common need.

Such type could be converted to/from int8range,
but would have easily accessible field names,
which is simpler than using lower() and upper(),
since upper() always returns the canonical
exclusive upper bound for discrete types,
which is not usually what you want when
dealing with "start" and "end" integer ranges.

Since there is no type named just "range", why not just use this name?

Since "end" is a keyword, I suggest the "stop" name:

PoC:

CREATE TYPE range AS (start int8, stop int8);

A real implementation would of course also verify CHECK (start <= stop),
and would add conversions to/from int8range.

I realise this is probably a controversial idea.
But, I think this is a general common problem that deserves a clean general solution.

Thoughts? More ideas?

[1] https://doc.rust-lang.org/std/ops/struct.Range.html
[2] https://en.cppreference.com/w/cpp/ranges
[3] https://www.w3schools.com/python/ref_func_range.asp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiro Ikeda 2021-03-05 10:54:23 Re: About to add WAL write/fsync statistics to pg_stat_wal view
Previous Message Amit Kapila 2021-03-05 10:35:17 Re: Parallel INSERT (INTO ... SELECT ...)