Cluster wide option to control symbol case folding

From: "Lewis, Ian \(Microstar Laboratories\)" <ilewis(at)mstarlabs(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Cluster wide option to control symbol case folding
Date: 2016-12-25 02:29:02
Message-ID: ACF85C502E55A143AB9F4ECFE960660A17282D@mailserver2.local.mstarlabs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Is there any chance that the PostgreSQL developers would accept a new
cluster wide configuration option to control how the system handles
symbol case folding?

Currently PostgreSQL folds all un-quoted symbols to lower case.

We would like to add a global configuration option with a name like
symbol_casefold

with settings to allow folding symbols to
lc - lower case (default)
uc - upper case
none - no case folding

USE CASE 1
The option we actually want for our own use is "none". Currently we have
several large bodies of code that work with an SQL Anywhere backend.
This server preserves case and - by a configuration option - performs
case insensitive lookup on all symbols. We are moving to PostgreSQL for
internal applications. We are also considering using PostgreSQL for the
storage backend in a product.

We use camel case for symbol names in all of our application and backend
code (largely written in C++, Object Pascal, and Python). In a language
like Pascal that ignores case, we still maintain consistent use of case
in symbol names for the sake of human readers of the code.

Where we make a connection from application code to a storage backend we
would like to use exactly the same symbol name - including case - for
the symbol in the backend as we use in the corresponding symbol in
application code. For example, if we intend to read a value into a
variable called FirstName we would like the corresponding field in the
database to also be FirstName. The main reason we want this exact match
is so that a human reader sees exactly the same thing in the two places.
This makes it easier to see the connection between the two bodies of
code. However, in places we also automate the connection between client
symbols and server symbols, and in such a case it is useful, though
certainly not necessary, to have an exact match.

While we do not do so at present, where we use a database purely through
code, we can likely add double quotes around all symbols in our
generated SQL, which means we can have an exact match as we want under
PostgreSQL as it currently behaves.

However, in many cases we also access the same database backend through
third party tools and ad hoc queries.

If we double quote all of our symbols, which is what we have tried to do
in our initial tests, then all such tools and manually written queries
must also double quote all symbols.

We use several tools and libraries that do not appear to have any way to
properly quote symbols when they are obtained automatically from the
schema. So, for example, if our reporting tool retrieves a field called
FirstName from the database schema, it happily uses that name in its
internally generated SQL. And, on PostgreSQL, of course this fails
because FirstName in script is firstname at the server, which is not
defined if we have double quoted the field name in the table definition.
Since we do not have the code, we cannot work around this in any easy
way.

For ad hoc queries we can double quote all symbol references. But, this
makes the queries noticeably harder for a person to read. And, I do not
believe this is just a matter of "getting used to it". The quotes
clutter the script, and that clutter makes the script fundamentally
harder to understand. Maybe this is small, but anything that makes
comprehension harder is a bad thing. Queries can be hard enough to
understand without extra syntactic clutter.

So, these considerations leave us the option of never quoting symbols
when using PostgreSQL. This works everywhere we have tried it. But, it
is pretty unattractive from the point of view of looking at the symbol
names in the backend if they are to match exactly the symbol names we
use in code. Using a different naming convention than we use everywhere
else in our code (underscore separated all lower case symbol names, say)
is not appealing either. We are very consistent in our symbol name
handling in our code, and breaking our conventions in some relatively
large section of our code is very unattractive.

In addition, in many places our tools and code use schema supplied field
names to form column titles in a table or in a caption on an edit box
(etc.). The mixed case names are much nicer for this purpose than the
folded names.

For our purposes, these are our arguments for wanting control of how the
server folds case.

USE CASE 2
Even though we have no use for it, I have included the option "uc"
because, in trying to determine whether PostgreSQL could support our
desired behavior, I found a fairly large number of people who are coming
from a different backend, such as Oracle (from what I see on the
internet - no personal experience), that case fold similarly to
PostgreSQL. However, instead of folding to lower case, it appears a
number of other database servers fold to upper case.

This leaves people who are moving from these other database systems with
problems in their own code that they have to patch up to be able to make
the port successfully.

With an option to case fold to upper case, such users would have an
easier time moving to PostgreSQL.

IMPLEMENTATION
We have not studied the PostgreSQL code enough to know for sure that the
implementation of this proposed configuration option is simple. However,
an initial review indicates that symbol case folding is handled in a
single place in the function downcase_identifier().

If this analysis is correct, and there are no interactions with keyword
case insensitivity handling, then the implementation should be quite
simple. And, the implementation - since it would be isolated to a small
change in a single function - would also be relatively risk free for
messing up the default behavior of converting symbols to lower case.
Therefore, we would not be likely to corrupt the main use case of the
default behavior.

The reason we are proposing a cluster wide option, rather than a per
database or per schema case folding configuration option, is that the
current PostgreSQL symbol case conversion processing does not appear to
have any context. Therefore, it appears it would take a quite difficult
- and therefore correspondingly risky - restructuring to provide context
for the symbol case conversion.

Before looking at the code, our original idea was to propose a locale
based configuration for controlling symbol case conversions. We could
then define a locale with no case conversion for our use. However,
looking at downcase_identifier(), it appears that the code performs an
ASCII case conversion, and the notes indicate that there is good reason
for this. Therefore, we dropped this idea.

TIME FRAME
It is not completely clear that we gain enough to justify implementing
this proposal. At the moment, for example, we have no idea how to build
the PostgreSQL code for testing. And, I do not have any understanding of
what testing requirements you would have on any patch we might submit.
So, I am not completely sure we can dedicate enough time to the work to
be able to do the job.

However, we are interested in looking at the task seriously enough to
determine whether we are up to the implementation. If we were to pursue
an implementation, we would likely have something within the next six
months or so as we would do it as part of currently active development
work to port our systems to PostgreSQL.

Before pursuing this, I would like to have some idea whether the
PostgreSQL developers would be likely to accept our patch if we
implemented it well.

We are not interested in basing our systems on our own branch of
PostgreSQL. The gain to us is insufficient to justify the maintenance
requirements of keeping up with PostgreSQL development. And, we do not
want to end up stuck on our own old version of the tool. Therefore, if
this is not a proposal that the developers are likely to entertain, we
will work with the PostgreSQL system as it stands.

Ian Lewis
www.mstarlabs.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-12-25 05:52:18 Re: Cluster wide option to control symbol case folding
Previous Message Joel Jacobson 2016-12-25 00:21:43 Re: pg_stat_activity.waiting_start