Quick Links

Re: Open issues for collations

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Open issues for collations
Date:	2011-04-05 14:13:54
Message-ID:	BANLkTi=ayf5RT6b6JE7LdaobB7VEYvBvHg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Reading through this thread...

On Sat, Mar 26, 2011 at 12:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any. Is either part of that
> sane? Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?

There were a couple of different ideas about which way we ought to go
with this, but I'm happy to defer to what Tom and Martijn hashed out:

MO> That seems all a bit weird. I spent some time reading through the SQL
MO> spec to see if I could came up with a few ideas about what they thought
MO> relevent. I think the gist of it is that I think the result row should
MO> have for each column its declared collation in all cases.

TL> That interpretation would be fine with me. It would let us get rid of
TL> the special-case code at lines 307-324 of parse_collate.c, which I put
TL> in only because there are cases in the collate.linux.utf8.sql regression
TL> test that fail without it. But I'm perfectly happy to conclude that
TL> those test cases are mistaken.

I'm not sure whether that's been done, though, or whether we're even
going to do it.

> ** What to do with domains whose declaration includes a COLLATE clause?
> Currently, we'll impute that collation to the result of a cast to the
> domain type --- even if the cast's input expression includes an
> explicit COLLATE clause. It's not clear that that's per spec. If it
> is correct, should we behave similarly for functions that are declared
> to return a domain type? Should it matter if the cast-to-domain is
> explicit or implicit? Perhaps it'd be best if domain collations only
> mattered for columns declared with that domain type. Then we'd have
> a general rule that collations only come into play in an expression
> as a result of (a) the declared type of a column reference or (b)
> an explicit COLLATE clause.

I think we had agreement than a cast to a domain type with a collation
should stomp on any existing collation on the contained expression.

> * In plpgsql, is it OK for declared local variables to inherit the
> function's input collation? Should we provide a COLLATE option in
> variable declarations to let that be overridden?

I think everyone who responded said "yes" to both questions.

> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation. Right now you may get either table's
> collation depending on which query type is involved. I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance. (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)

I tested this and, indeed, if the collations don't match, the index
can't be used.

CREATE TABLE me (x character varying COLLATE "en_US");
CREATE TABLE me2 (x character varying COLLATE "es_ES");
CREATE TABLE you (x character varying COLLATE "es_ES" NOT NULL,
PRIMARY KEY (x));
ALTER TABLE me ADD FOREIGN KEY (x) REFERENCES you(x);
ALTER TABLE me2 ADD FOREIGN KEY (x) REFERENCES you(x);
SET enable_seqscan=false;
SET enable_hashjoin=false;

With that setup, this still does a seqscan-and-sort:

EXPLAIN select * from me, you where me.x = you.x;

But this uses the index:

EXPLAIN select * from me2, you where me2.x = you.x;

I found another problem, too:

rhaas=# insert into you values ('1');
INSERT 0 1
rhaas=# insert into me values ('1');
INSERT 0 1
rhaas=# alter table me alter column x set data type varchar collate "en_GB";
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.
CONTEXT: SQL statement "SELECT fk."x" FROM ONLY "public"."me" fk LEFT
OUTER JOIN ONLY "public"."you" pk ON ( pk."x"::pg_catalog.text
OPERATOR(pg_catalog.=) fk."x"::pg_catalog.text) WHERE pk."x" IS NULL
AND (fk."x" IS NOT NULL)"

> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8. I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.

Seems like no one knows how to do this.

> * Remove initdb's warning about useless locales? Seems like pointless
> noise, or at least something that can be relegated to debug mode.

Everyone was in favor of this.

> * Is it worth adding a cares-about-collation flag to pg_proc? Probably
> too late to be worrying about such refinements for 9.1.

Probably too late for this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Open issues for collations at 2011-03-26 04:36:43 from Tom Lane

Responses

Re: Open issues for collations at 2011-04-08 17:57:21 from Tom Lane
Re: Open issues for collations at 2011-04-18 14:41:07 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2011-04-05 14:31:54	Re: time table for beta1
Previous Message	Greg Stark	2011-04-05 14:01:42	Re: Re: synchronous_commit and synchronous_replication Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.