Re: PGDay.it collation discussion notes

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-20 09:28:45
Message-ID: 48FC4F4D.2040403@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Another objection to this design is that it's completely unclear that
> functions from text to text should necessarily yield the same collation
> that went into them, but if you treat collation as a hard-wired part of
> the expression syntax tree you aren't going to be able to do anything else.
> (What will you do about functions/operators taking more than one text
> argument?)

Whatever the spec says. Collation is intimately associated with the
comparison operations, and doesn't make any sense anywhere else. The way
the default collation for a given operation is determined, by bubbling
up the collation from the operands, through function calls and other
expressions, is just to make life a bit easier for the developer who's
writing the SQL. We could demand that you always explicitly specify a
collation when you use the text equality or inequality operators, but
because that would be quite tiresome, a reasonable default is derived
from the context.

I believe the spec stipulates how that default is derived, so I don't
think we need to fret over it. We'll need it eventually, but the parser
changes is not the critical part. We can start off by deriving the
collation from a GUC variable, for example.

> I think it would be better to treat the collation indicator as part of
> string *values* and let it bubble up through expressions that way.
> The "expr COLLATE ident" syntax would be a simple run-time operation
> that pokes a new collation into a string value. The notion of a column
> having a particular collation would then amount to a check constraint on
> the values going into the column.

Looking at an individual value, collation just doesn't make sense.
Collation is property of the comparison operation, not of a value.

In the parser, we might have to do something like that though, because
according to the standard you can tack the COLLATION keyword to string
constants and have it bubble up. But let's keep that ugliness just
inside the parser.

One, impractical, way to implement collation would be to have one
operator class per collation. In fact you could do that today, with no
backend changes, to support multiple collations. It's totally
impractical, because for starters you'd need different comparison
operators, with different names, for each collation. But it's the right
mental model.

I think the right approach is to invent a new concept called "operator
modifier". It's basically a 3rd argument to operators. It can be
specified explicitly when an operator is used, with syntax like "<left>
Op <right> USING <modifier>", or in case of collation, it's derived from
the context, per SQL spec. The operator modifier is tacked on to OpExprs
and SortClauses in the parser, and passed as a 3rd argument to the
function implementing the operator at execution time.

When an index is created, if the operators in the operator class take an
operator modifier, it's stored at creation time into a new column in
pg_index (needs to be a vector or array to handle multi-column indexes).
The planner needs to check the modifier when it determines whether an
index can be used or not.

BTW, this reminds me of the discussions we had about the tsearch default
configuration. It's different, though, because in full text search,
there's a separate tsvector data type, and the problem was with
expression indexes, not regular ones.

Another consideration is LC_CTYPE. Just like we want to support
different collations, we should support different character
classifications for upper()/lower(). We might want to tie it into
collation, as using different ctype and collation doesn't usually make
sense, but it's something to keep in mind.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ITAGAKI Takahiro 2008-10-20 10:01:32 Re: contrib/pg_stat_statements
Previous Message Simon Riggs 2008-10-20 09:25:29 Hot Standby utility and administrator functions