SQL-spec incompatibilities in similar_escape() and related stuff

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Subject: SQL-spec incompatibilities in similar_escape() and related stuff
Date: 2019-05-13 00:43:34
Message-ID: 14047.1557708214@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Over in pgsql-bugs [1], we're looking into some bugs associated with
mistranslation of SQL-spec regexes to POSIX regexes. However, while
poking at that I couldn't help noticing that there are more ways in
which we're not following the letter of the SQL spec in this area:

* As Andrew noted, somewhere between SQL99 and SQL:2008, the committee
just up and changed the syntax of <regular expression substring function>.
SQL99 has

<regular expression substring function> ::=
SUBSTRING <left paren> <character value expression>
FROM <character value expression>
FOR <escape character> <right paren>

but in recent versions it's

<regular expression substring function> ::=
SUBSTRING <left paren> <character value expression>
SIMILAR <character value expression>
ESCAPE <escape character> <right paren>

I am, frankly, inclined to ignore this as a bad idea. We do have
SIMILAR and ESCAPE as keywords already, but they're type_func_name_keyword
and unreserved_keyword respectively. To support this syntax, I'm pretty
sure we'd have to make them both fully reserved. That seems likely to
break existing applications, and I don't think it's worth it. But it's
probably something to discuss.

* Our function similar_escape() is not documented, but it underlies
three things in the grammar:

Translated as "a ~ similar_escape(b, null)"

Translated as "a ~ similar_escape(b, e)"

substring(a, b, e)
This is a SQL function expanding to
select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3))

To support the first usage, similar_escape is non-strict, and it takes
a NULL second argument to mean '\'. This is already a SQL spec violation,
because as far as I can tell from the spec, if you don't write an ESCAPE
clause then there is *no* escape character; there certainly is not a
default of '\'. However, we document this behavior, so I don't know if
we want to change it.

This behavior also causes spec compatibility problems in the second
syntax, because "a SIMILAR TO b ESCAPE NULL" is treated as though
it were "ESCAPE '\'", which is again a spec violation: the result
should be null.

And, just to add icing on the cake, it causes performance problems
in the third syntax. 3-argument substring is labeled proisstrict,
which is correct behavior per spec (the result is NULL if any of
the three arguments are null). But because similar_escape is not
strict, the planner fails to inline the SQL function, reasoning
(quite accurately) that doing so would change the behavior for
null inputs. This costs us something like 4x performance compared
to the underlying 2-argument POSIX-regex substring() function.

I'm not sure what we want to do here, but we probably ought to do
something, because right now substring() and SIMILAR TO aren't even
in agreement between themselves let alone with the SQL spec. We
could either move towards making all these constructs strict in
accordance with the spec (and possibly breaking some existing
applications), or we could make substring(a, b, e) not strict so
that it inherits similar_escape's idea of what to do for e = NULL.

* similar_escape considers a zero-length escape string to mean
"no escape character". This is contrary to spec which clearly
says that a zero-length escape string is an error condition
(just as more-than-one-character is an error condition). It's
also undocumented. Should we tighten that up to conform to spec,
or document it as an extension?

* Per spec, escape-double-quote must appear exactly twice in
the second argument of substring(a, b, e), while it's not valid
in SIMILAR TO. similar_escape() doesn't enforce this, and it
can't do so as long as we are using the same pattern conversion
function for both constructs. However, we could do better than
we're doing:

* If there are zero occurrences, then what you get from substring()
is the whole input string if it matches, as if escape-double-quote
had appeared at each end of the string. I think this is fine, but
we ought to document it.

* If there are an odd number of occurrences, similar_escape() doesn't
complain, but you'll get this from the regex engine:
ERROR: invalid regular expression: parentheses () not balanced
The fact of an error isn't a problem, but the error message is pretty
confusing considering that what the user wrote was not parentheses.
I think similar_escape() ought to throw its own error with an on-point

* If there are more than two pairs of escape-double-quote, you get
some behavior that's completely not per spec --- the patterns
between the additional pairs still contribute to whether there's an
overall match, but they don't affect the result substring. I'm
inclined to think we ought to throw an error for this, too.

* The spec is much tighter than we are concerning what's a legal
escape sequence. This is partly intentional on our part, I think;
notably, you can get at POSIX-regex escapes like "\d", which the
SQL spec doesn't provide. Although I think this is intentional,
it's not documented. I'm not sure if we want to tighten that
up or document what we allow ... thoughts?

I am not eager to change any of this in released branches, but
I think there's a good case for doing something about these
points in HEAD.

regards, tom lane

[1] https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net


Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-05-13 01:37:25 Re: PG 12 draft release notes
Previous Message Noah Misch 2019-05-13 00:37:05 Re: [HACKERS] WAL logging problem in 9.4.3?