| From: | Peter Eisentraut <peter(at)eisentraut(dot)org> | 
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Support regular expressions with nondeterministic collations | 
| Date: | 2024-10-22 08:16:47 | 
| Message-ID: | 899e7b5f-b54a-4e1b-9218-bb23534fc2c4@eisentraut.org | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
This patch allows using regular expression functions and operators with 
nondeterministic collations.
This complements the patches "Support LIKE with nondeterministic 
collations" and "Support POSITION with nondeterministic collations" but 
is independent.  These three together fix most of the places where 
nondeterministic collations are currently not allowed.
I had to decide here what the semantics should be.  The SQL standard 
doesn't say anything, it just refers to XQuery.  XQuery has no knowledge 
of SQL collations.  I also studied the relevant Unicode standard (UTS 
#18) and it makes no mention of collations.  So my conclusion is that 
regular expressions should pay no attention to collations.  That makes 
it easy.
To clarify a bit more: They don't pay attention to the collate part of 
collations.  So if you have an accent-insensitive collation, that 
doesn't make the regular expression match accent-insensitive.  But it 
does and continues to pay attention to the ctype part of collations. 
The latter is a PostgreSQL extension.
Note that UTS #18 has "retracted" support for tailoring in regular 
expressions, which supports the idea that regular expressions should be 
independent of things like language settings.
I think this is sensible.  Regular expressions are inherently based on 
sequences of characters, and trying to marry that with nondeterministic 
collations just doesn't fit.
But: We also convert SIMILAR TO patterns to standard regular 
expressions, and SIMILAR TO is covered in the SQL standard.  And the 
definition there does take the collation into account.  But the 
definition there is pretty much impossible to implement for 
nondeterministic collations:  It basically says, the predicate is true 
if the string to be matched is equal, using the applicable collation, to 
any of the strings in the set of strings described by the regular 
expression.  Which is a nice computer-sciency way to define it, but it 
doesn't work in practice.
So I need a way to remember whether a regular expression was originally 
a SIMILAR TO pattern and then error out if the collation is 
nondeterministic.  I figured out a way to do that:  Regular expressions 
support prefixes like "***X", where X is some character.  I added a new 
prefix "***S".  This is not externally visible, it just gets used 
internally, and it doesn't conflict with real regular expressions.
In summary, this patch doesn't change any functionality that currently 
works.  It just removes one error message and lets regular expressions 
just run, independent of whether the collation is nondeterministic.
| Attachment | Content-Type | Size | 
|---|---|---|
| v1-0001-Support-regular-expressions-with-nondeterministic.patch | text/plain | 8.9 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Eisentraut | 2024-10-22 08:32:41 | Re: Fix C23 compiler warning | 
| Previous Message | Alexander Korotkov | 2024-10-22 07:34:15 | Re: type cache cleanup improvements |