Re: PGDay.it collation discussion notes

From: "Dave Gudeman" <dave(dot)gudeman(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, Postgres <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PGDay.it collation discussion notes
Date: 2008-10-22 17:43:06
Message-ID: 7b079fba0810221043o4d205782p883d8a8df84f54f9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 20, 2008 at 2:28 AM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Tom Lane wrote:
>
>> Another objection to this design is that it's completely unclear that
>> functions from text to text should necessarily yield the same collation
>> that went into them, but if you treat collation as a hard-wired part of
>> the expression syntax tree you aren't going to be able to do anything
>> else.
>> (What will you do about functions/operators taking more than one text
>> argument?)
>>
>
> Whatever the spec says. Collation is intimately associated with the
> comparison operations, and doesn't make any sense anywhere else.

Of course the comparison operator is involved in many areas such as index
creation, ORDER BY, GROUP BY, etc. In order to support GROUP BY and hash
joins on values with a collation type, you need to have a hash function
corresponding to the collation.

> The way the default collation for a given operation is determined, by
> bubbling up the collation from the operands, through function calls and
> other expressions, is just to make life a bit easier for the developer who's
> writing the SQL.We could demand that you always explicitly specify a
> collation when you use the text equality or inequality operators, but
> because that would be quite tiresome, a reasonable default is derived from
> the context.

In this sense, collation is no different from any other feature of the
value's type. You could require explicit type annotations everywhere.

> Looking at an individual value, collation just doesn't make sense.
> Collation is property of the comparison operation, not of a value.
>

Collation can't be a property of the comparison operation because you don't
know what comparison to use until you know the collation type of the value.
Collation is a property of string values, just like scale and precision are
properties of numeric values. And like those properties of numeric values,
collation can be statically determined. The rules for determining what
collation to use in an expression are similar in kind to the rules for
determining what the resulting scale and precision of an arithmetic
expression are. If you consider collation as just part of the type, a lot of
things are easier.

>
> In the parser, we might have to do something like that though, because
> according to the standard you can tack the COLLATION keyword to string
> constants and have it bubble up. But let's keep that ugliness just inside
> the parser.

The COLLATION expression is no different in kind from a type cast. It just
works on a restricted part of the type.

> One, impractical, way to implement collation would be to have one operator
> class per collation. In fact you could do that today, with no backend
> changes, to support multiple collations. It's totally impractical, because
> for starters you'd need different comparison operators, with different
> names, for each collation. But it's the right mental model.

You can use that model, but it is simpler to view it as an overloaded
function. You don't conceptually imagine that DECIMAL(10,4) and
DECIMAL(20,2) have different comparison operations, so why would you view
that two strings with different collations have different comparison
operations?

I think the right approach is to invent a new concept called "operator
> modifier". It's basically a 3rd argument to operators. It can be specified
> explicitly when an operator is used, with syntax like "<left> Op <right>
> USING <modifier>", or in case of collation, it's derived from the context,
> per SQL spec. The operator modifier is tacked on to OpExprs and SortClauses
> in the parser, and passed as a 3rd argument to the function implementing the
> operator at execution time.

This is a good way to implement collated comparisons, but it's not a new
concept, just an additional argument to the comparison operator. It isn't
necessary to create new concepts to handle collation when it fits so well
into an existing concept, the type. For example, the difference between two
indexes with collation is a difference in the type of the index --just like
the difference between a DECIMAL(10,4) index and a DECIMAL(20,2) index.

When I added collation to a commercial RDBMS it made things a lot easier to
just fold the collation into the type system. After all, the type defines
the operators that act on it and collation is just a specialization of this
notion. Incidentally, collation can be easily extended to non-string types;
it is just the section of the type information that controls how the values
are compared (and hashed). This could be very useful for datetime values and
user-defined types as well as strings.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-10-22 17:57:42 Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.
Previous Message Heikki Linnakangas 2008-10-22 17:42:03 Re: Deriving Recovery Snapshots