Re: Define jsonpath functions as stable

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Define jsonpath functions as stable
Date: 2019-09-16 22:39:40
Message-ID: 740d5ed3-b639-dfcd-a905-ca8f5657c041@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/16/19 5:10 PM, Tom Lane wrote:
> "Jonathan S. Katz" <jkatz(at)postgresql(dot)org> writes:
>> On 9/16/19 11:20 AM, Tom Lane wrote:
>>> I think we could possibly get away with not having any special marker
>>> on regexes, but just explaining in the documentation that "features
>>> so-and-so are not implemented". Writing that text would require closer
>>> analysis than I've seen in this thread as to exactly what the differences
>>> are.
>
>> +1, and likely would need some example strings too that highlight the
>> difference in how they are processed.
>
> I spent an hour digging through these specs.

Thanks! That sounds like quite the endeavor...

> I was initially troubled
> by the fact that XML Schema regexps are implicitly anchored, ie must
> match the whole string; that's a huge difference from POSIX. However,
> 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
> predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
> defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
> except for some weirdness around newline matching; and that spec
> clearly says that fn:matches treats its pattern argument as NOT anchored.
> So it looks like we end up in the same place as POSIX for this.
>
> Otherwise, the pattern language differences I could find are all details
> of character class expressions (bracket expressions, such as "[a-z0-9]")
> and escapes that are character class shorthands:
>
> * We don't have "character class subtraction". I'd be pretty hesitant
> to add that to our regexp language because it seems to change "-" into
> a metacharacter, which would break an awful lot of regexps. I might
> be misunderstanding their syntax for it, because elsewhere that spec
> explicitly claims that "-" is not a metacharacter.

Using something I could understand[1] it looks like the syntax is like:

[a-z-[aeiou]

e.g. all the consonants of the alphabet. I don't believe that would
break many, if any, regexps. I also don't know what kind of effort it
would take to add that in given I had not looked at the regexp code
until today (and only at some of the amusing comments in the header
file, which seemed like it wasn't expected the code would be read 20
years later), but it would likely not be a v12 problem.

> * Character class elements can be #xNN (NN being hex digits), which seems
> equivalent to POSIX \xNN as long as you're using UTF8 encoding. Again,
> the compatibility costs of allowing that don't seem attractive, since #
> isn't a metacharacter today.

Seems reasonable.

> * Character class elements can be \p{UnicodeProperty} or
> the complement \P{UnicodeProperty}, where there are a bunch of different
> possible properties. Perhaps we could add that someday; since there's no
> reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
> compatibility hit. But I'm content to document this as unimplemented
> for now.

+1.

> * XQuery adds character class shorthands \i (complement \I) for "initial
> name characters" and \c (complement \C) for "NameChar". Same as above;
> maybe add someday, but no hurry.

+1.

> * It looks like XQuery's \w class might allow more characters than our
> interpretation does, and hence \W allows fewer. But since \w devolves
> to what libc thinks the "alnum" class is, it's at least possible that
> some locales might do the same thing XQuery calls for.

I'd still add this to the "to document" list.

> * The SQL-spec newline business mentioned above is a possible exception:
> it appears to require that when '.' is allowed to match newlines, a
> single '.' should match a '\r\n' Windows newline. I think we can
> document that and move on.

+1.

> * The x flag in XQuery is defined as ignoring all whitespace in
> the pattern except within character class expressions. Spencer's
> x flag does mostly that, but it thinks that "\ " means a literal space
> whereas XQuery explicitly says that the space is ignored and the
> backslash applies to the next non-space character. (That's just
> weird, in my book.) Also, Spencer's x mode causes # to begin
> a comment extending to EOL, which is a nice thing XQuery hasn't
> got, and it says you can't put spaces within multi-character
> symbols like "(?:", which presumably is allowed with XQuery's "x".
>
> I feel a bit uncomfortable with these inconsistencies in x-flag
> rules. We could probably teach the regexp library to have an
> alternate expanded mode that matches XQuery's rules, but that's
> not a project to tackle for v12.

That does not sound fun by any means. But likely that would be a part of
an overall effort to implement XQuery rules.

> I tentatively recommend that
> we remove the jsonpath "x" flag for the time being.

I would add an alternative suggestion of just removing that "x" is
supported in the documentation...but likely better to just remove the
flag + docs.

> Also, I noted some things that seem to be flat out sloppiness
> in the XQuery flag conversions:
>
> * The newline-matching flags (m and s flags) can be mapped to
> features of Spencer's library, but jsonpath_gram.y does so
> incorrectly
> * XQuery says that the q flag overrides m, s, and x flags, which is
> exactly the opposite of what our code does; besides which the code
> is flag-order-sensitive which is just wrong.
>
> These last two are simple to fix and we should just go do it.

+1.

> Otherwise, I think we're okay with regarding Spencer's library
> as being a sufficiently close approximation to LIKE_REGEX.
> We need some documentation work though.

My main question is "where" -- I'm thinking somewhere in the JSON
path[2] section. After reading your email 3 times, I may have enough
knowledge to attempt some documentation on the regexp in JSON path.

Jonathan

[1] https://www.regular-expressions.info/charclasssubtract.html
[2]
https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2019-09-16 23:11:13 Re: Define jsonpath functions as stable
Previous Message Thomas Munro 2019-09-16 22:03:20 Re: POC: Cleaning up orphaned files using undo logs