Re: Open issues for collations

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Open issues for collations
Date: 2011-03-26 17:16:34
Message-ID: 20110326171634.GA16407@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 26, 2011 at 12:36:43AM -0400, Tom Lane wrote:
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any. Is either part of that
> sane? Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?

That seems all a bit weird. I spent some time reading through the SQL
spec to see if I could came up with a few ideas about what they thought
relevent. I think the gist of it is that I think the result row should
have for each column its declared collation in all cases.

Firstly, the SQL doesn't go into the general case where the collate
result of a function is based in it inputs. But in any case, if the
function returns a record, the whole record would have that collation,
which is absurd. I think letting it go to the contained columns is just
weird.

Secondly, I think the derivation algorithm is for determing the
collation of expressions which have no otherwise declared collation.
Anything returning a predefined record type has a predefined collation
and it should be used. If you're in a query referring to rowvar.field
and rowvar has a type, that's what should be used.

(No doubt you have some corner cases in mind?)

As for operators, they should behave like functions wherever possible,
otherwise it's just introducing unnecessary differences.

The cast-case is related to below.

> ** What to do with domains whose declaration includes a COLLATE clause?
> Currently, we'll impute that collation to the result of a cast to the
> domain type --- even if the cast's input expression includes an
> explicit COLLATE clause. It's not clear that that's per spec. If it
> is correct, should we behave similarly for functions that are declared
> to return a domain type? Should it matter if the cast-to-domain is
> explicit or implicit? Perhaps it'd be best if domain collations only
> mattered for columns declared with that domain type. Then we'd have
> a general rule that collations only come into play in an expression
> as a result of (a) the declared type of a column reference or (b)
> an explicit COLLATE clause.

The SQL spec considers the collation to be part of the datatype, so if
you're casting to a domain (or type) you get the collation associated
with that domain (or type). As per the spec:

"The collation derivation of a declared type with a declared type
collation that is explicitly or implicitly specified by a <data type>
is implicit."

So the result of a cast would be the collation of the specified
type/domain with state implicit.

Also, apparently the COLLATE clause as allowed anywhere where a
datatype is permitted. So you can say:

CAST( foo AS TEXT COLLATE "en_US" )

Not sure if that works now. The result would be implicit state, as
opposed to if the COLLATE clause appears elsewhere.

Incidently, a function returning a domain seems weird to me. What does
it mean: (1) the function returns this type, Postgres assumes this is
true, or (2) function returns something, Postgres does an implicit
cast?

In any case, I'd suggest it is treated as being included in the
resolving of the return collation with the arguments so if the result
is a domain and you apply the normal rules you get:

(1) explicit states in the arguments will override it
(2) if arguments are implicit state and conflict with domain, the
result is no-collation, otherwise implicitly whatever the domain was
(3) no arguments have collation, which means you get the domain
default.

Which all seems eminently reasonable.

So I'd agree with your rules, but add a case (c) result of a cast.

> * In plpgsql, is it OK for declared local variables to inherit the
> function's input collation? Should we provide a COLLATE option in
> variable declarations to let that be overridden? If Oracle understands
> COLLATE, probably we should look at what they do in PL/SQL.

If COLLATE is allowed anywhere where the datatype is allowed, then the
COLLATE clause should be permitted there. Otherwise they become the
specified type with whatever the default is for that type. In
expressions the coercible-default state will get overridden
by the IMPLICIT state from the arguments as appropriate.

I note I'm using the term coercible default here, because that's what
Transact-SQL calls the state for any variable or value that's not a column
reference. I'm just checking and don't see any support for it in the
SQL standard. While it seemed to me to be extremely useful, since it
allows column references to override literals.

> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation. Right now you may get either table's
> collation depending on which query type is involved. I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance. (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)

Using the referenced collation makes it clear that you don't want to
throw an error if the collations don't match. So yes, the generated
queries need to have the correct collation clauses.

Collations which redefine equality are on the horizon, case and accent
insensetivity being the obvious cases. Better get it right now.

> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8. I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.

That'd be nice. Perhaps one the existing cases can be converted to some
latin-X encoding. It does rely on the collation existing on the test
machine.

> * Is it worth adding a cares-about-collation flag to pg_proc? Probably
> too late to be worrying about such refinements for 9.1.

Long term it will be needed. It's been skipped now, but it will let you
produce better errors messages in the future.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
> - Charles de Gaulle

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Itagaki Takahiro 2011-03-26 17:44:09 Re: Lock problem with autovacuum truncating heap
Previous Message Robert Haas 2011-03-26 16:48:44 Re: Open issues for collations