Re: Ltree syntax improvement

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dmitry Belyavsky <beldmit(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Nikolay Shaplov <dhyan(at)nataraj(dot)su>
Subject: Re: Ltree syntax improvement
Date: 2020-04-05 01:26:53
Message-ID: 9462.1586050013@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru> writes:
> Fixed patch attached.

I looked through this again, and I still don't feel very happy with it.

As I mentioned upthread, I'm not convinced that we ought to have
*both* quoting and backslash-escaping in these datatypes. TIMTOWTDI
may be a virtue to the Perl crowd, but to a lot of other people it's
just confusing and extra complication. In particular it will translate
directly to extra complication in every application that wants to use
non-alphanumeric labels, since they'll have to be prepared to deparse
whatever style the backend happens to put out. It's also adding a far
from trivial amount of complication to the patch itself. And it adds
bug-prone ambiguity; for instance, the patch fails to document whether
backslashes do anything special within quotes.

On balance, since the existing limitations on labels are an approximation
of SQL identifier rules, I think it'd be a good idea to use SQL-identifier
quoting rules; that is, you have to use quotes for anything that isn't
alphanumeric, backslashes are not special, and the way to get a double
quote that's data is to write two adjacent double quotes. That's simple
and it doesn't create any confusion when writing an ltree value inside
a SQL literal. It's less great if you're trying to write an ltree within
an array or record literal value, but we can't have everything (and the
backslash alternative would be no better for that anyway).

I've still got mixed emotions about the exception of allowing whitespace
to be stripped before or after a label. While that probably doesn't pose
any forward-compatibility hazards (ie, it's unlikely we'd want to redefine
such syntax to mean something different), I'm not sure it passes the
least-confusion test. We were just getting beat up a couple days ago
about the fact that record_in is lax about whitespace [1][2], so maybe
we shouldn't make that same choice here.

As far as the code itself goes, I don't think the code structure in
ltree_io.c is very well chosen. There seem to be half a dozen routines
that are all intimately dependent on the quoting/escaping rules, and
it's really pretty hard to be sure that they're all in sync (and if
they're not, that's going to lead to crashes and perhaps security-grade
bugs). Moreover it looks like you end up making multiple passes over the
input, so it's inefficient as well as hard to understand/maintain. Given
the choice to have a separate frontend function that recognizes a label as
a single token, it seems like it'd be better if that function were also
responsible for generating a de-quoted label value as it went.

I also feel that not enough attention has been paid to the error
reporting. For instance, the patch changes the longstanding policy of
reporting an overlength label with its end position:

@@ -53,7 +54,7 @@ SELECT repeat('x', 255)::ltree;

SELECT repeat('x', 256)::ltree;
ERROR: label string is too long
-DETAIL: Label length is 256, must be at most 255, at character 257.
+DETAIL: Label length is 256, must be at most 255, at character 1.
SELECT ltree2text('1.2.3.34.sdf');
ltree2text
--------------

That might be an OK choice in a green field, but this isn't a green field,
even if we did change the wording of the message since v12. I also noted
some random inconsistencies such as

regression=# select '1234567890*1234567890'::ltree;
ERROR: ltree syntax error at character 11
LINE 1: select '1234567890*1234567890'::ltree;
^
regression=# select '1234567890+1234567890'::ltree;
ERROR: ltree syntax error at character 11
LINE 1: select '1234567890+1234567890'::ltree;
^
DETAIL: Unexpected character

Apparently that's because * is special to lquery while + isn't, but
that distinction really shouldn't matter to ltree. (This DETAIL
message isn't close to following project style for detail messages,
either.) It'd probably be better if the frontend function didn't
contain assumptions about which punctuation characters are important.

Another thought here, though it's not really the fault of this patch,
is that I really think ltree ought to go over to a treatment of
non-ASCII characters that's more like the core parser's, ie anything
that isn't ASCII is data (or assumed-to-be-alphanumeric, if you prefer).
The trouble with the current definition is that it's dependent on
LC_CTYPE, so labels that are acceptable to one database might not be
acceptable elsewhere. We could remove that hazard, and as a bonus
eliminate some possibly-expensive character classification tests,
if we just said all non-ASCII characters are data.

regards, tom lane

[1] https://www.postgresql.org/message-id/BCE3F9FC-6BE9-44D8-AD25-F5FF109C7BD4@yugabyte.com
[2] https://www.postgresql.org/message-id/158593753274.7901.11178770123847486396%40wrigleys.postgresql.org

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-04-05 04:02:56 Re: Reinitialize stack base after fork (for the benefit of rr)?
Previous Message Justin Pryzby 2020-04-05 00:22:43 Re: VACUUM memory management