Re: Per-column collation, the finale

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Per-column collation, the finale
Date: 2011-02-02 21:20:44
Message-ID: 1296681644.7277.33.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On lör, 2011-01-29 at 02:52 -0500, Noah Misch wrote:
> I'm reviewing this patch as part of CommitFest 2011-01.

Thank you very much for this thorough review.

> This no longer applies cleanly against git master, so I've done my testing
> against 52713d0.

I will send an updated patch soon, when I have figured out the issues
discussed below.

> If you go about supporting non-GNU libc systems, you may want to look at the
> gnulib locale module[2] for a starting point.

I don't have a Mac system, but I'll see about figuring this out.

> The new documentation is helpful. It would be useful to document the
> implications of marking your user-defined type COLLATABLE. As best I can tell,
> you should only set COLLATABLE if functions that would be expected to respect
> collation, particularly members of a B-tree operator family, do check the fmgr
> collation. Otherwise, the user might specify collations for data of your type
> and be surprised to get no differentiated behavior. Are there other
> considerations? Also, should implementers of COLLATABLE types observe any
> restrictions on when to respect the collation? For example, is equality
> permitted to be sensitive to collation?

I improved the documentation in the CREATE TYPE man page about this.
Note, however, that the questions you raise are not new. There already
is a set of types that observe locale information in a set of associated
functions. The answers are already hidden somewhere; this feature just
exposes them in a different way. Over time, we may want to document
more of this, but at the moment it's not clear how much to actually
expose.

> On that note, the following is accepted; should it be?
>
> CREATE TABLE parent (key text COLLATE "sv_SE");
> CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID");
> CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key));
>
> Does the system expect any invariants to hold on the platform implementation of
> collations? For example, that they be consistent with equality in some way? If
> so, would it be practical and/or helpful to provide a tool for testing a given
> platform locale for conformance?

We discussed this previously in this thread. Currently, collation
support does not affect equality. Values are only equal if they are
bytewise equal. That's why the above is currently OK, but it should
probably be restricted in the future. I'll make a note about it.

> pg_attribute entries for an index currently retain the attcollation of the table
> columns. They should have the collation of the index columns. At that point,
> pg_index.indcollation would exist for optimization only.

Fixed.

> Though there's no reason it ought to happen with the first commit, it would be
> really excellent for "make check" to detect when the platform has sufficient
> support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
> locales). Manual execution is fine for something like numeric_big, but this is
> so platform-sensitive that testing it widely is important.

I solicited ideas about this a while ago, but no one came up with a
solution. If we do get Mac or Windows support, it would be a good time
to revisit this and devise a platform-independent approach, if possible.

> It would likewise be nice to have a way to display the collation of an arbitrary
> expression. A merger of pg_typeof and format_type would cover that.

Agreed. I have a follow-up patch pending for that.

> Four call sites gain code like this (example is from varstr_cmp):
>
> +#ifdef HAVE_LOCALE_T
> + locale_t mylocale = pg_newlocale_from_collation(collid);
> +#else
> + check_default_collation(collid);
> +#endif
> ...
> +#ifdef HAVE_LOCALE_T
> + result = strcoll_l(a1p, a2p, mylocale);
> +#else
> result = strcoll(a1p, a2p);
> +#endif
>
> Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
> that merely calls check_default_collation, "#define strcoll_l(a, b, c)
> strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the
> abstraction would make the mainline code significantly cleaner. Even so, it
> might not pay off. Thoughts?

I had considered that, but I'm mainly hesitant about introducing a fake
locale_t type into the public namespace. Maybe it should be wrapped
into a pg_locale_t; then we can do whatever we want.

> Couldn't check_default_collation be a no-op except on assert-only builds? It's
> currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
> have added non-default collations. CREATE COLLATION will presumably be made to
> fail unconditionally on such builds, too.

Unless you register the HAVE_LOCALE_T state in pg_control, you need to
be prepared to deal with schemas that contain nondefault collations even
though the server binary doesn't support it.

> wchar2char and char2wchar take a collation argument, but they only use it to
> "Assert(!lc_ctype_is_c(collation))". Maybe that's best, but it seems weird.

Yeah, it's weird, but that is the existing interface.

> The documentation for pg_type.typcollation marks it as a bool, but it's an oid.
> The documentation also says "An index can only support one collation.", but it's
> one collation per column.

Fixed, leftover from previous code.

> This surprised me; not sure how much can/should be done:
>
> [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID";
> ERROR: collations are not supported by type integer
> [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> -- worked ...

Yeah, this could be fixed. I'll make a note about this. Note that this
is actually SQL92 syntax, removed in SQL99, so it's not essential.

> This too:
>
> [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc;
> -- worked ...
> [local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> -- worked ...
> [local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
> ERROR: column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function
> LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE...

This is correct, I think. At least if you want to keep open the
possibility that a collation could also determine equality.

> You implement the standard's requirement to forbid "CAST(foo AS text COLLATE
> "id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead. You also
> reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"".
> Is that desirable, needed to avoid syntactic ambiguity, or just an unintended
> consequence? In the absence of other considerations, it seems undesirable to
> require the parentheses.

This is just how the parser interprets it. The :: syntax is full of odd
ambiguities like that.

> UNION on disparate collations seems to error or strip them, depending on how
> it's written:
>
> CREATE TABLE t_sv (c text COLLATE "sv_SE");
> CREATE TABLE t_tr (c text COLLATE "tr_TR");
> CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr;
> -- t_both.c has default collation
> CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR";
> -- ERROR: 42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR"

This is correct per SQL standard. In the second case, the explicit
COLLATE clauses cause a conflict, so you get an error. In the first
case, you have implicit collation derivations that conflict, so the
result has actually no collation (not default collation), which means
you cannot sort the resulting value at all. It's actually a bit strange
here that we allow the creation of a table based on a query that has a
no collation derivation, but I did not see anything in the SQL standard
that prohibits it. Maybe a notice would be useful, like when you create
a column of type "unknown".

> COLLATE is getting recognized in various places where it does not actually do
> anything. Probably a few more places need SimpleTypenameWithoutCollation:
>
> CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text);
> ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm;
> CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION;
> ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm;
> CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';
> CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text';

All of these places also accept useless typmods. The parser is
currently pretty loose about that. It would need some major
restructuring to fix that.

> PREPARE foo(text collate "id_ID") AS SELECT $1; -- maybe this one does something? didn't verify

It doesn't, but it probably could. I'll check it out.

> The 13% slowdown with the feature unused seems worrisome, but the additional 16%
> slowdown when using the feature to adopt a different locale seems fine.

I figured out this problem. It needs some small restructuring in
varstr_cmp to fix. It kind of relates to how the
pg_newlocale_from_collation/check_default_collation restructuring is
done. But it's definitely fixable.

Well, my numbers were actually smaller than 13%, but it was noticeable.
I'll give you something new to test later.

> Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
> to have a different concept of collation naming. This patch creates a "default"
> collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The
> standard has specific names for specific default collations based on the
> character repertoire (our server_encoding, I suppose). So, the most-conformant
> thing would be to set the name of the pg_collation of oid = 100 at CREATE
> DATABASE time, based on the chosen ENCODING. Note sure how valuable this is.

The problem is, you can't reach across databases when you create a new
one. This is really the root of all evil with this default collation
business.

> The standard requires that collations be GRANTable objects, but that seems
> fairly useless in practice.

Yeah, omitted for now.

> Section 11.4 puts the COLLATE clause as the last part of the column definition.
> This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work,
> but it does not. It appears CREATE TABLE gets the COLLATE clause via
> SimpleTypname, but to support the standard syntax, it would need to arrive via
> ColConstraintElem or some such. I have not looked at the problems this might
> bring. A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE =
> 'foo') COLLATE "id_ID"" (11.24).

This is only under the optional feature F692 "Extended collation
support", which I chose to not bother with, because it doesn't provide
any additional functionality. (We could consider it if it were required
for compatibility with some other implementation.)

> Also, 11.4 syntax rule 9 seems to forbid the
> following, which the patch accepts. Seems just as well to part from the spec in
> this regard, though:
>
> CREATE DOMAIN dom AS text;
> CREATE TABLE t (c dom COLLATE "id_ID");

Good catch. Doesn't seem useful to restrict this, though.

This is good stuff. I'll send you a new patch in a day or three for
perhaps another round of performance tests. Some of the other issues
above can perhaps be postponed for follow-up patches.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pitts 2011-02-02 21:25:00 Re: [HACKERS] Slow count(*) again...
Previous Message Kenneth Marshall 2011-02-02 21:14:06 Re: [HACKERS] Slow count(*) again...