Re: Open issues for collations

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Open issues for collations
Date: 2011-03-27 19:14:37
Message-ID: 14998.1301253277@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> On Sat, Mar 26, 2011 at 12:36:43AM -0400, Tom Lane wrote:
>> ** Selecting a field from a record-returning function's output.
>> Currently, we'll use the field's declared collation; except that
>> if the field has default collation, we'll replace that with the common
>> collation of the function's inputs, if any. Is either part of that
>> sane? Do we need to make this work for functions invoked with other
>> syntax than a plain function call, eg operator or cast syntax?

> That seems all a bit weird. I spent some time reading through the SQL
> spec to see if I could came up with a few ideas about what they thought
> relevent. I think the gist of it is that I think the result row should
> have for each column its declared collation in all cases.

That interpretation would be fine with me. It would let us get rid of
the special-case code at lines 307-324 of parse_collate.c, which I put
in only because there are cases in the collate.linux.utf8.sql regression
test that fail without it. But I'm perfectly happy to conclude that
those test cases are mistaken.

>> ** What to do with domains whose declaration includes a COLLATE clause?

> The SQL spec considers the collation to be part of the datatype, so if
> you're casting to a domain (or type) you get the collation associated
> with that domain (or type). As per the spec:

> "The collation derivation of a declared type with a declared type
> collation that is explicitly or implicitly specified by a <data type>
> is implicit."

> So the result of a cast would be the collation of the specified
> type/domain with state implicit.

Hm. That makes sense for explicit CAST syntax, but what about a
function returning a collatable type? In particular, applying this
rule to the || operator would have us conclude that

x || y COLLATE "foo"

doesn't have an overall collation of "foo", which seems clearly wrong.
But if you claim that it's all based on the result type of the function
then it's hard to avoid that outcome.

> Also, apparently the COLLATE clause as allowed anywhere where a
> datatype is permitted. So you can say:

> CAST( foo AS TEXT COLLATE "en_US" )

No, you can't; this case is specifically disallowed by SQL:2008 6.12
<cast specification> syntax rule 1b: <data type> shall not contain a
<collate clause>. Which is a pretty weird thing for them to say if
you believe that collations are nothing more nor less than a data type
property. It kinda leads me to think that that is not how the committee
sees collations ... although the spec is pretty unhelpful at suggesting
exactly how you should think of them instead.

> Incidently, a function returning a domain seems weird to me. What does
> it mean: (1) the function returns this type, Postgres assumes this is
> true, or (2) function returns something, Postgres does an implicit
> cast?

> In any case, I'd suggest it is treated as being included in the
> resolving of the return collation with the arguments so if the result
> is a domain and you apply the normal rules you get:

> (1) explicit states in the arguments will override it
> (2) if arguments are implicit state and conflict with domain, the
> result is no-collation, otherwise implicitly whatever the domain was
> (3) no arguments have collation, which means you get the domain
> default.

> Which all seems eminently reasonable.

It's reasonable by itself, but it's also rather different from what you
just argued the behavior should be for field selection. I don't see
the rationale for treating those cases differently. In fact, given that
Postgres has always treated f(x) and x.f as equivalent notations,
I think there's a pretty strong argument for wanting their effects on
collation choice to be equivalent too.

>> * In plpgsql, is it OK for declared local variables to inherit the
>> function's input collation? Should we provide a COLLATE option in
>> variable declarations to let that be overridden? If Oracle understands
>> COLLATE, probably we should look at what they do in PL/SQL.

> If COLLATE is allowed anywhere where the datatype is allowed, then the
> COLLATE clause should be permitted there. Otherwise they become the
> specified type with whatever the default is for that type. In
> expressions the coercible-default state will get overridden
> by the IMPLICIT state from the arguments as appropriate.

No, I don't care for that, because then it matters whether an expression
contains any direct references to the input arguments versus containing
just local variables. In particular, assigning a parameter to a local
variable and then using the local variable instead of the parameter
could change the function's results. That seems mighty surprising and
bug-prone to me. So I think it's important that local variables default
to the function's input collation. I have no problem with adding
explicit COLLATE to the declaration syntax for plpgsql variables,
though.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2011-03-27 19:20:12 Re: Performance Farm Release
Previous Message Tomáš Pospíšil 2011-03-27 19:01:37 Re: [HACKERS] Needs Suggestion