Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE

From: Bryn Llewellyn <bryn(at)yugabyte(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Gavan Schneider <list(dot)pg(dot)gavan(at)pendari(dot)org>, Tom Lane PostgreSQL <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general list <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE
Date: 2022-10-08 00:16:45
Message-ID: 72349704-92B2-46FF-8DE1-C3E43C9B5B6A@yugabyte.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> david(dot)g(dot)johnston(at)gmail(dot)com wrote:
>
>> bryn(at)yugabyte(dot)com wrote:
>>
>> (3) The PG doc on quote_ident says this in large friendly letters:
>>
>>> Quotes are added only if necessary…
>>
>> Notice "only". I now know that this is very much not the case. You can compose an effectively unlimited number of different examples along these lines:
>>
>> select quote_ident('redaktør'); → "redaktør"
>> create table redaktør(n int); → table successfully created
>
> Yep, and that is precisely what would make for a good bug report. Pointing out that "if necessary" does not indeed match up with the behavior. I suspect it is likely to get changed - everything else being discussed just detracts attention from it.

*BRIEFLY*

What does "make for a good bug report" mean, David? Is it:

(1.1) You, David, or somebody else who has been officially recognized as a PG Contributor (https://www.postgresql.org/community/contributors/) will file the bug, granting it credibility with their imprimatur?

or (1.2) I, Bryn, should file the bug.

About "I suspect it is likely to get changed", do you mean:

(2.1) Change the doc to match quote_ident's current, unpredictable, behavior? (By all means, substitute "hard to describe accurately, precisely, and yet still tersely" for "unpredictable".)

(2.2) Change quote_ident's implementation—and then write new doc to describe the new behavior precisely and accurately? And for this option, the next question is "What's the spec of the changed implementation?"

Notice that the issue is broader than just quote_ident, as this test shows:

prepare x(text) as select format('« %s », gives « %I ».', $1::text, $1::text);
execute x('dog');
execute x('Dog');
execute x('农民');

The same over-zealous double-quoting that quote_ident shows for 农民 is shown by format. Presumably they share the same underlying implementation (but, surprisingly, don't re-use the actual SQL parser code). Option 2.1 implies using the same wording for what provokes double-quoting for each function. I'd make a similar argument for option 2.2.

*MORE DETAIL*

About option (2.2), I mentioned that ORCL's equivalent to quote_ident implements a simpler rule: the text of the identifier that it returns is always surrounded with double quotes, whether or not doing this is necessary. The ORCL scheme relies on the fact that double-quoting when this isn't necessary is in no way harmful. Here’s pseudocode (presented as tested PL/pgSQL) for what a patched C implementation of quote_ident might do:

create function quote_ident_2(name in text)
returns text
language plpgsql
as $body$
declare
i0 text not null := quote_ident(name);
i1 text not null := regexp_replace(regexp_replace(i0, '^"', ''), '"$', '');
ident text not null := case(i1 = i0)
when true then '"'||i0||'"'
else i0
end case;
begin
return ident;
end;
$body$;

Re David’s

> everything else being discussed just detracts attention from it.

I’m not convinced. The discussion has shown that some people are somewhat confused. For example, it was suggested that a name like this:

农民

ought to be double-quoted. A simple test shows that this isn’t the case. And it helps if everybody is clear about that.

There's also the question of use-cases. I've been forced to think a lot about SQL injection over the years. It's immediately obvious from reading any of the skimpiest blogs on the topic that the root cause is always faulty code that's been written by a confused developer. But it's very rare to see an account of the root cause whose wording uses carefully defined terms of art. (In fact, the notion of defining and using such terms of art has been resisted by contributors to this list.) If a future PG shipped with a built-in function like this:

function is_exotic(name in text) returns boolean

...with, of course, excellent documentation, then an outfit that decided to outlaw the use of exotic names (and this is almost always how the outcome emerges) could police adherence to the rule, database wide, (and cluster wide for global phenomena) with a few trivial tests against the relevant catalog relations, like this:

do $body$
begin
assert not exists (select 1 from pg_class where is_exotic(relname));
end;
$body$;

A shipped "is_exotic" function would bring the secondary, but by no means insignificant, benefit, that the case for using “format” with "%I" or its "quote_ident" cousin could be grounded upon solidly defined, and named, notions. Like this example shows:

do $body$
declare
stmt constant text not null := 'create table %s(n int);';
n text not null := '';

simple_names constant text[] not null := array['farmers', 'bønder', '农民', '农民、儿童', '农民,儿童'];

exotic_names constant text[] not null := array['T', 'farmers and children', '"x', 'y"', '农民 & 儿童'];
begin
foreach n in array simple_names loop
assert not is_exotic(n), 'Problem with '||n;
end loop;

foreach n in array exotic_names loop
assert is_exotic(n), 'Problem with '||n;
end loop;
end;
$body$;

Who knew that using Chinese punctuation characters in a name (they have their own code points, and look a bit different when you look closely) don't make the name exotic...

*THE OTHER CHOICE (for option 2.2): make it true that quote_ident and format with %I surround the input with double quotes exactly and only when this is necessary*

Choosing between the two alternatives (the ORCL rule or dependable parsimony) wouldn't matter much if PG comes with a built-in is_exotic function. But if the user has to implement their own, then the implementation would be far, far simpler and therefore more reliable if it were decided to implement dependable parsimony.

The following code shows what I mean.

create function is_exotic(name in text)
returns boolean
language plpgsql
as $body$
declare
ident constant text not null := quote_ident_2(name);
stripped_ident constant text not null := regexp_replace(regexp_replace(ident, '^"', ''), '"$', '');
stmt constant text not null := 'deallocate %s';
begin
/*
This is the easy special case. The input is entirely letters inascii(7), or some
other character set that distinguishes between upper and lower case, and has at
least one upper case letter. This literal actual argument 'Dogs' is a sufficient
example. Here, the value of "stripped_ident" will be equal to the value of the
input. It's essential to test for this special case because the general test,
below, is (famously) happy to execute "deallocate Dog" and the like.
*/
if lower(name) <> stripped_ident then
return true;
end if;

/*
This is the harder general case. The input is anything at all that passes the
first test and that causes no error when used "as is" in the role of a SQL identifier.
Without access to the actual implementation of the SQL parser, the simplest practical
way to check this property is to use an empirical test. The "deallocate" statement is
fairly lightweight. The approach seems to bring no measureable performance penalty.
*/
begin
execute format(stmt, stripped_ident);
exception when invalid_sql_statement_name then null; end;
return false;

exception when syntax_error then
return true;
end;
$body$;

Compare this with the implementation that I thought, at first, that I could use when I simply believed the doc. (The subject line of this thread hits at the trivial SQL statement that would implement the "language SQL" function.) ANd if that's all there is to it, then when not ship is as a built-in?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2022-10-08 00:27:17 Re: ('dog$house' = quote_ident('dog$house')) is surprisingly FALSE
Previous Message Adrian Klaver 2022-10-07 21:36:56 Re: pg_restore creates public schema?