Re: Per-column collation, the finale

From: Noah Misch <noah(at)leadboat(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Per-column collation, the finale
Date: 2011-01-29 07:52:53
Message-ID: 20110129075253.GA18784@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter,

I'm reviewing this patch as part of CommitFest 2011-01.

On Fri, Jan 14, 2011 at 11:41:46PM +0200, Peter Eisentraut wrote:
> I've been going over this patch with a fine-tooth comb for the last two
> weeks, fixed some small bugs, added comments, made initdb a little
> friendlier, but no substantial changes.
>
> I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
> support and associated things now.

This no longer applies cleanly against git master, so I've done my testing
against 52713d0. I tested on Ubtunu 8.04 (GNU libc 2.7) and used the
configure.in change I sent earlier[1] to get locale support detected. Since the
time to seriously verify every patch hunk would be measured in weeks, I have not
attempted that level of review detail. I've just commented on what stood out
while reading and using the patch.

If you go about supporting non-GNU libc systems, you may want to look at the
gnulib locale module[2] for a starting point.

The new documentation is helpful. It would be useful to document the
implications of marking your user-defined type COLLATABLE. As best I can tell,
you should only set COLLATABLE if functions that would be expected to respect
collation, particularly members of a B-tree operator family, do check the fmgr
collation. Otherwise, the user might specify collations for data of your type
and be surprised to get no differentiated behavior. Are there other
considerations? Also, should implementers of COLLATABLE types observe any
restrictions on when to respect the collation? For example, is equality
permitted to be sensitive to collation? On that note, the following is
accepted; should it be?

CREATE TABLE parent (key text COLLATE "sv_SE");
CREATE UNIQUE INDEX ON parent(key COLLATE "id_ID");
CREATE TABLE child (key text COLLATE "tr_TR" REFERENCES parent(key));

Does the system expect any invariants to hold on the platform implementation of
collations? For example, that they be consistent with equality in some way? If
so, would it be practical and/or helpful to provide a tool for testing a given
platform locale for conformance?

pg_attribute entries for an index currently retain the attcollation of the table
columns. They should have the collation of the index columns. At that point,
pg_index.indcollation would exist for optimization only.

Though there's no reason it ought to happen with the first commit, it would be
really excellent for "make check" to detect when the platform has sufficient
support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
locales). Manual execution is fine for something like numeric_big, but this is
so platform-sensitive that testing it widely is important.

It would likewise be nice to have a way to display the collation of an arbitrary
expression. A merger of pg_typeof and format_type would cover that.

Four call sites gain code like this (example is from varstr_cmp):

+#ifdef HAVE_LOCALE_T
+ locale_t mylocale = pg_newlocale_from_collation(collid);
+#else
+ check_default_collation(collid);
+#endif
...
+#ifdef HAVE_LOCALE_T
+ result = strcoll_l(a1p, a2p, mylocale);
+#else
result = strcoll(a1p, a2p);
+#endif

Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
that merely calls check_default_collation, "#define strcoll_l(a, b, c)
strcoll(a, b)", etc. I can see six TODO sites with similar needs, and the
abstraction would make the mainline code significantly cleaner. Even so, it
might not pay off. Thoughts?

Couldn't check_default_collation be a no-op except on assert-only builds? It's
currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
have added non-default collations. CREATE COLLATION will presumably be made to
fail unconditionally on such builds, too.

wchar2char and char2wchar take a collation argument, but they only use it to
"Assert(!lc_ctype_is_c(collation))". Maybe that's best, but it seems weird.

The documentation for pg_type.typcollation marks it as a bool, but it's an oid.
The documentation also says "An index can only support one collation.", but it's
one collation per column.

This surprised me; not sure how much can/should be done:

[local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY 1 COLLATE "id_ID";
ERROR: collations are not supported by type integer
[local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
-- worked ...

This too:

[local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc;
-- worked ...
[local] test=# SELECT prosrc COLLATE "id_ID", count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
-- worked ...
[local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE "id_ID";
ERROR: column "pg_proc.prosrc" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE...

You implement the standard's requirement to forbid "CAST(foo AS text COLLATE
"id_ID")"; one writes "CAST(foo AS text) COLLATE "id_ID"" instead. You also
reject "foo::text COLLATE "id_ID""; one can write "(foo::text) COLLATE "id_ID"".
Is that desirable, needed to avoid syntactic ambiguity, or just an unintended
consequence? In the absence of other considerations, it seems undesirable to
require the parentheses.

UNION on disparate collations seems to error or strip them, depending on how
it's written:

CREATE TABLE t_sv (c text COLLATE "sv_SE");
CREATE TABLE t_tr (c text COLLATE "tr_TR");
CREATE TABLE t_both AS SELECT c FROM t_sv UNION ALL SELECT c FROM t_tr;
-- t_both.c has default collation
CREATE TABLE t_otherboth AS SELECT ('foo'::text) COLLATE "sv_SE" UNION ALL SELECT ('bar'::text) COLLATE "tr_TR";
-- ERROR: 42P21: collation mismatch between explicit collations "sv_SE" and "tr_TR"

COLLATE is getting recognized in various places where it does not actually do
anything. Probably a few more places need SimpleTypenameWithoutCollation:

CREATE OPERATOR =-= (PROCEDURE = texteq, LEFTARG = text COLLATE "id_ID", RIGHTARG = text);
ALTER OPERATOR || (text COLLATE "id_ID", text) OWNER TO nm;
CREATE CAST (text COLLATE "id_ID" AS mytext) WITHOUT FUNCTION;
ALTER AGGREGATE min(text collate "id_ID") OWNER TO nm;
CREATE FUNCTION f(varchar collate "id_ID") RETURNS INT LANGUAGE SQL AS 'SELECT 1';
CREATE FUNCTION f() RETURNS TABLE (c text COLLATE "id_ID") LANGUAGE SQL AS 'SELECT ''foo''::text';
PREPARE foo(text collate "id_ID") AS SELECT $1; -- maybe this one does something? didn't verify

I ran this benchmark:

-- Setup: table of 10M random Unicode characters
SELECT setseed(0);
CREATE TABLE t (c) AS
SELECT chr(1 + (random() * 65534)::int) FROM generate_series(1,10000000) gen(n);

-- en_US.UTF8, unpatched: 65.73s
-- en_US.UTF8, patched: 74.43s
-- id_ID.UTF8, unpatched: 65.77s
-- id_ID.UTF8, patched: 74.53s
SELECT min(c)

-- en_US.UTF8, patched: 86.46s
-- id_ID.UTF8, patched: 83.70s
--SELECT min(c COLLATE "id_ID")

FROM (
SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
) t_all;

The 13% slowdown with the feature unused seems worrisome, but the additional 16%
slowdown when using the feature to adopt a different locale seems fine. Here is
an opreport (entries >1%) from master (actually 52713d0) running the first test
SELECT under en_US.UTF8:

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
59288 12.7660 ExecProcNode
44183 9.5136 advance_transition_function
43211 9.3043 slot_deform_tuple
36019 7.7557 ExecProject
33495 7.2122 heapgettup_pagemode
26394 5.6832 varstr_cmp
26316 5.6664 advance_aggregates
23895 5.1451 ExecAgg
14472 3.1161 ExecStoreTuple
12939 2.7860 ExecScan
12750 2.7454 HeapTupleSatisfiesMVCC
12250 2.6377 heapgetpage
11695 2.5182 text_smaller
11559 2.4889 slot_getsomeattrs
10774 2.3199 ExecClearTuple
9092 1.9577 text_cmp
7724 1.6631 SeqNext
7354 1.5835 heap_getnext
7257 1.5626 ExecAppend
6647 1.4312 pg_detoast_datum_packed
6505 1.4007 AllocSetReset
5692 1.2256 XidInMVCCSnapshot

Patched, same test case:

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
58488 12.9975 ExecProcNode
45430 10.0957 advance_transition_function
42843 9.5208 slot_deform_tuple
35877 7.9728 ExecProject
32926 7.3170 heapgettup_pagemode
27181 6.0403 varstr_cmp
26342 5.8539 advance_aggregates
22243 4.9430 ExecAgg
14452 3.2116 ExecStoreTuple
13171 2.9269 ExecScan
11708 2.6018 text_smaller
11417 2.5371 slot_getsomeattrs
10758 2.3907 heapgetpage
10710 2.3800 ExecClearTuple
10035 2.2300 HeapTupleSatisfiesMVCC
8998 1.9996 text_cmp
7721 1.7158 SeqNext
7329 1.6287 heap_getnext
7324 1.6276 ExecAppend
6534 1.4520 AllocSetReset
6481 1.4402 pg_detoast_datum_packed
5862 1.3027 XidInMVCCSnapshot
4608 1.0240 MemoryContextReset

Nothing really pops out there.

pg_dump successfully preserves collation for table columns, index columns, and
domains.

Concerning standards conformance, ISO 9075-2:2008 sections 4.2.4 and 4.2.6 seem
to have a different concept of collation naming. This patch creates a "default"
collation that is implicitly the LC_COLLATE/LC_CTYPE of the database. The
standard has specific names for specific default collations based on the
character repertoire (our server_encoding, I suppose). So, the most-conformant
thing would be to set the name of the pg_collation of oid = 100 at CREATE
DATABASE time, based on the chosen ENCODING. Note sure how valuable this is.

The standard requires that collations be GRANTable objects, but that seems
fairly useless in practice.

Section 11.4 puts the COLLATE clause as the last part of the column definition.
This would require "CREATE TABLE t (foo text NOT NULL COLLATE "id_ID")" to work,
but it does not. It appears CREATE TABLE gets the COLLATE clause via
SimpleTypname, but to support the standard syntax, it would need to arrive via
ColConstraintElem or some such. I have not looked at the problems this might
bring. A similar problem arises with "CREATE DOMAIN dom AS text CHECK (VALUE =
'foo') COLLATE "id_ID"" (11.24). Also, 11.4 syntax rule 9 seems to forbid the
following, which the patch accepts. Seems just as well to part from the spec in
this regard, though:

CREATE DOMAIN dom AS text;
CREATE TABLE t (c dom COLLATE "id_ID");

Overall, this patch implements a useful feature, and I could not find any deep
unsoundness in the implementation. There's obviously plenty left to do
(multi-platform support, DDL for collations, various TODOs in the patch), but
one need not expect to cover all that in one patch. I'd be most concerned about
the performance regression when not using the feature.

I'm sure I've missed plenty, and even this review probably contains errors.

Thanks,
nm

[1] http://archives.postgresql.org/pgsql-hackers/2011-01/msg02443.php
[2] http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob_plain;f=m4/locale_h.m4

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2011-01-29 08:26:07 Re: ALTER TYPE 3: add facility to identify further no-work cases
Previous Message Fujii Masao 2011-01-29 07:10:19 Re: Include WAL in base backup