Re: [HACKERS] is there a deep unyielding reason to limit U&'' literals to ASCII?

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] is there a deep unyielding reason to limit U&'' literals to ASCII?
Date: 2019-03-15 21:04:00
Message-ID: 6688474e-7c28-b352-bcec-ea0ef59d7a1a@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/25/16 12:52 PM, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sat, Jan 23, 2016 at 11:27 PM, Chapman Flack <chap(at)anastigmatix(dot)net> wrote:
>>> What I would have expected would be to allow <Unicode escape value>s
>>> for any Unicode codepoint that's representable in the server encoding,
>>> whatever encoding that is.
>
>> I don't know anything for sure here, but I wonder if it would make
>> validating string literals in non-UTF8 encodings significant more
>> costly.
>
> I think it would, and it would likely also require function calls to
> loadable functions (at least given the current design whereby encoding
> conversions are farmed out to loadable libraries). I do not especially
> want the lexer doing that; it will open all sorts of fun questions
> involving what we can lex in an already-failed transaction.

How outlandish would it be (not for v12, obviously!) to decree that
the lexer produces UTF-8 representations of string and identifier
literals unconditionally, and in some later stage of processing
the parse tree, those get munged to the server encoding if different?

That would keep the lexer simple, and I think it's in principle
the 'correct' view if there is such a thing; choice of encoding doesn't
change what counts as valid lexical form for a U&'...' or U&"..."
literal, but only whether a literal thus created might happen to fit
in your encoding.

If it doesn't, I think that's technically a data error (22021)
rather than one of syntax or lexical form.

> It may well be that these issues are surmountable with some sweat,
> but it doesn't sound like an easy patch to me. And how big is the
> use-case, really?

Hmm, other than the benefit of not having to explain why it /doesn't/
work?

one could imagine a tool generating SQL output that'll be saved and
run in a database through client or server encodings not known in
advance, adopting a simple strategy of producing only 7-bit ASCII
output and using U& literals for whatever ain't ASCII ... that would
be, in principle, about the most bulletproof way for such a tool
to work, but it's exactly what won't work in PostgreSQL unless the
encoding is UTF-8 (which is the one scenario where there's no need
for such machinations, as the literals could appear directly!).

I'm a maintainer of one such SQL-generating tool, so I know the set
of use cases would have at least one element, if only it would work.

Regards,
-Chap

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-15 21:10:23 Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3
Previous Message Tom Lane 2019-03-15 19:45:28 Re: hyrax vs. RelationBuildPartitionDesc