Re: Collation versioning

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Douglas Doole <dougdoole(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation versioning
Date: 2019-12-12 04:00:58
Message-ID: CA+hUKGK8CwBcTcXWL2kUjpHT+6t2hEFCzkcZ-Z7xXbz=C4NLCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 12, 2019 at 2:45 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> Hearing no objection in [1], attached v5 with following changes:
>
> - fix the spurious warnings by doing the version check in
> get_relation_info and vacuum_open_relation, and add a flag in
> RelationData to mark the check as already being done
> - change the IsCatalogRelation() check to IsSystemRelation() to also
> ignore toast indexes, as those can also be safely ignored too
> - add a new ALTER INDEX idxname DEPENDS ON COLLATION collname CURRENT
> VERSION to let users remove the warnings for safe library upgrade.
> Documentation and tab completion added
>
> If I'm not mistaken, all issues I was aware of are now fixed.

Thanks! This is some great progress and I'm feeling positive about
getting this into PostgreSQL 13. I haven't (re)reviewed the code yet,
but I played with it a bit and have some more feedback.

There are some missing semi-colons on the ALTER INDEX statements in
pg_dump.c that make the pg_upgrade test fail (at least, if LC_ALL is
set).

We create duplicate pg_depend records:

postgres=# create table t (x text);
CREATE TABLE
postgres=# create index on t(x) where x > 'hello';
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
and refobjversion != '';
classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
---------+-------+----------+------------+----------+-------------+---------------+---------
1259 | 16424 | 0 | 3456 | 100 | 0 |
0:34.0 | n
1259 | 16424 | 0 | 3456 | 100 | 0 |
0:34.0 | n
(2 rows)

I wondered if that was harmless, but for one thing it causes duplicate warnings:

postgres=# update pg_depend set refobjversion = 'BANANA' where
refobjversion = '0:34.0';
UPDATE 2

[new session]
postgres=# select count(*) from t;
WARNING: index "t_x_idx" depends on collation "default" version
"BANANA", but the current version is "0:34.0"
DETAIL: The index may be corrupted due to changes in sort order.
HINT: REINDEX to avoid the risk of corruption.
WARNING: index "t_x_idx" depends on collation "default" version
"BANANA", but the current version is "0:34.0"
DETAIL: The index may be corrupted due to changes in sort order.
HINT: REINDEX to avoid the risk of corruption.

Here's another way to get a duplicate, and in this example you also
get an unnecessary dependency on 100 "default" for this index:

postgres=# create index on t(x collate "fr_FR") where x > 'helicopter'
collate "fr_FR";
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
and refobjversion != '';
classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
---------+-------+----------+------------+----------+-------------+---------------+---------
1259 | 16460 | 0 | 3456 | 12603 | 0 |
0:34.0 | n
1259 | 16460 | 0 | 3456 | 12603 | 0 |
0:34.0 | n
1259 | 16460 | 0 | 3456 | 100 | 0 |
0:34.0 | n
(3 rows)

Or... maybe 100 should be there, by simple analysis of the x in the
WHERE clause, but it's the same if you write x collate "fr_FR" >
'helicopter' collate "fr_FR", and in that case there are no
expressions of collation "default" anywhere.

The indirection through composite types works nicely:

postgres=# create type foo_t as (en text collate "en_CA", fr text
collate "fr_CA");
CREATE TYPE
postgres=# create table t (foo foo_t);
CREATE TABLE
postgres=# create index on t(foo);
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_foo_idx'::regclass
and refobjversion != '';
classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
---------+-------+----------+------------+----------+-------------+---------------+---------
1259 | 16444 | 0 | 3456 | 12554 | 0 |
0:34.0 | n
1259 | 16444 | 0 | 3456 | 12597 | 0 |
0:34.0 | n
(2 rows)

... but again it shows the extra and technically unnecessary
dependencies (only 12603 "fr_FR" is really needed):

postgres=# create index on t(((foo).fr collate "fr_FR"));
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_fr_idx'::regclass
and refobjversion != '';
classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
---------+-------+----------+------------+----------+-------------+---------------+---------
1259 | 16445 | 0 | 3456 | 12603 | 0 |
0:34.0 | n
1259 | 16445 | 0 | 3456 | 12597 | 0 |
0:34.0 | n
1259 | 16445 | 0 | 3456 | 12554 | 0 |
0:34.0 | n
(3 rows)

I check that nested types are examined recursively, as appropriate. I
also tested domains, arrays, arrays of domains, expressions extracting
an element from an array of a domain with an explicit collation, and
the only problem I could find was more ways to get duplicates. Hmm...
what else is there that can contain a collatable type...? Ranges!

postgres=# create type myrange as range (subtype = text);
CREATE TYPE
postgres=# drop table t;
DROP TABLE
postgres=# create table t (x myrange);
CREATE TABLE
postgres=# create index on t(x);
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
and refobjversion != '';
classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
---------+-------+----------+------------+----------+-------------+---------------+---------
(0 rows)

... or perhaps, more realistically, a GIST index might actually be
useful for range queries, and we're not capturing the dependency:

postgres=# create index t_x_idx on t using gist (x);
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
and refobjversion != '';
classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
---------+-------+----------+------------+----------+-------------+---------------+---------
(0 rows)

The new syntax "ALTER INDEX i_name DEPENDS ON ANY COLLATION UNKNOWN
VERSION" doesn't sound good to me, it's not "ANY" collation, it's a
specific set of collations that we aren't listing. "ALTER INDEX
i_name DEPENDS ON COLLATION * VERSION UNKNOWN", hrmph, no that's
terrible... I'm not sure what would be better.

I'm not sure if I like the idea of VACUUM reporting warnings or not. Hmm.

To state more explicitly what's happening here, we're searching the
expression trees for subexpresions that have a collation as part of
their static type. We don't know which functions or operators are
actually affected by the collation, though. For example, if an
expression says "x IS NOT NULL" and x happens to be a subexpression of
a type with a particular collation, we don't now that this
expression's value can't possibly be affected by the collation version
changing. So, the system will nag you to rebuild an index just
because you mentioned it, even though the index can't be corrupted.
To do better than that, I suppose we'd need declarations in the
catalog to say which functions/operators are collation sensitive.
Then, as a special case, there is the collation of the actual indexed
value, because that will implicitly be used as input to the btree ops
that would be collation sensitive. That's just a thought experiment:
it seems like massive overkill to try to catalog collation sensitivity
for a rather limited benefit, and I'm happy with the way you have it.

More soon.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2019-12-12 04:07:48 Let people set host(no)ssl settings from initdb
Previous Message Kyotaro Horiguchi 2019-12-12 03:48:56 Re: BUG #16159: recovery requests WALs for the next timelines before timeline switch LSN has been reached