Re: Collation versioning

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Douglas Doole <dougdoole(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation versioning
Date: 2019-12-12 14:36:15
Message-ID: CAOBaU_Zg==075nNEhd+Du0x32q0ojHC1DJ86mBCW5JYKaosb5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Thomas,

Thanks for looking at it!

On Thu, Dec 12, 2019 at 5:01 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> We create duplicate pg_depend records:
>
> [...]
>
> I wondered if that was harmless

That's the assumed behavior of recordMultipleDependencies:

/*
* Record the Dependency. Note we don't bother to check for
* duplicate dependencies; there's no harm in them.
*/

We could add a check to skip duplicates for the "track_version ==
true" path, or switch to flags if we want to also skip duplicates in
other cases, but it'll make recordMultipleDependencies a little bit
more specialised.

> but for one thing it causes duplicate warnings:

Yes, that should be avoided.

> Here's another way to get a duplicate, and in this example you also
> get an unnecessary dependency on 100 "default" for this index:
>
> postgres=# create index on t(x collate "fr_FR") where x > 'helicopter'
> collate "fr_FR";
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
> classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
> 1259 | 16460 | 0 | 3456 | 12603 | 0 |
> 0:34.0 | n
> 1259 | 16460 | 0 | 3456 | 12603 | 0 |
> 0:34.0 | n
> 1259 | 16460 | 0 | 3456 | 100 | 0 |
> 0:34.0 | n
> (3 rows)
>
> Or... maybe 100 should be there, by simple analysis of the x in the
> WHERE clause, but it's the same if you write x collate "fr_FR" >
> 'helicopter' collate "fr_FR", and in that case there are no
> expressions of collation "default" anywhere.

Ah good point. That's because expression_tree_walker() will dig into
CollateExpr->args and eventually reach the underlying Var. I don't
see an easy way to avoid that while still properly recording the
required dependency for an even more realistic index such as

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" >
'helicopter' COLLATE "en_US")::text) collate "fr_FR";

and for instance not for

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" ||
'helicopter' COLLATE "en_US")) collate "fr_FR";

> The indirection through composite types works nicely:
>
> postgres=# create type foo_t as (en text collate "en_CA", fr text
> collate "fr_CA");
> CREATE TYPE
> postgres=# create table t (foo foo_t);
> CREATE TABLE
> postgres=# create index on t(foo);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_foo_idx'::regclass
> and refobjversion != '';
> classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
> 1259 | 16444 | 0 | 3456 | 12554 | 0 |
> 0:34.0 | n
> 1259 | 16444 | 0 | 3456 | 12597 | 0 |
> 0:34.0 | n
> (2 rows)
>
> ... but again it shows the extra and technically unnecessary
> dependencies (only 12603 "fr_FR" is really needed):
>
> postgres=# create index on t(((foo).fr collate "fr_FR"));
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_fr_idx'::regclass
> and refobjversion != '';
> classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
> 1259 | 16445 | 0 | 3456 | 12603 | 0 |
> 0:34.0 | n
> 1259 | 16445 | 0 | 3456 | 12597 | 0 |
> 0:34.0 | n
> 1259 | 16445 | 0 | 3456 | 12554 | 0 |
> 0:34.0 | n
> (3 rows)

Yes :(

> I check that nested types are examined recursively, as appropriate. I
> also tested domains, arrays, arrays of domains, expressions extracting
> an element from an array of a domain with an explicit collation, and
> the only problem I could find was more ways to get duplicates. Hmm...
> what else is there that can contain a collatable type...? Ranges!
>
> postgres=# create type myrange as range (subtype = text);
> CREATE TYPE
> postgres=# drop table t;
> DROP TABLE
> postgres=# create table t (x myrange);
> CREATE TABLE
> postgres=# create index on t(x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
> classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
> (0 rows)
>
> ... or perhaps, more realistically, a GIST index might actually be
> useful for range queries, and we're not capturing the dependency:
>
> postgres=# create index t_x_idx on t using gist (x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
> classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
> (0 rows)

Good catch :) I fixed it locally and checked that a gist index on a
range with a subtype being a composite type does record the required
dependencies.

> The new syntax "ALTER INDEX i_name DEPENDS ON ANY COLLATION UNKNOWN
> VERSION" doesn't sound good to me, it's not "ANY" collation, it's a
> specific set of collations that we aren't listing. "ALTER INDEX
> i_name DEPENDS ON COLLATION * VERSION UNKNOWN", hrmph, no that's
> terrible... I'm not sure what would be better.

Mmm, indeed. With a 3rd round in the existing keyword, how about
"DEPENDS ON [ ANY ] REFERENCING COLLATION"? The ANY is mostly to
avoid the need for plural.

> I'm not sure if I like the idea of VACUUM reporting warnings or not. Hmm.

Even if I add this in a IsAutoVacuumWorkerProcess?

> To state more explicitly what's happening here, we're searching the
> expression trees for subexpresions that have a collation as part of
> their static type. We don't know which functions or operators are
> actually affected by the collation, though. For example, if an
> expression says "x IS NOT NULL" and x happens to be a subexpression of
> a type with a particular collation, we don't now that this
> expression's value can't possibly be affected by the collation version
> changing. So, the system will nag you to rebuild an index just
> because you mentioned it, even though the index can't be corrupted.
> To do better than that, I suppose we'd need declarations in the
> catalog to say which functions/operators are collation sensitive.

Wouldn't that still be a problem for an absurd expression like

WHERE length((val collate "en_US" > 'uh' collate "en_US")::text) > 0

And since we would still have to record a dependency on the collation
in such case, we would need to have another magic value to distinguish
"unknown" from "cannot cause corruption" collation version.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-12-12 14:41:18 Re: Duplicate function call on timestamp2tm
Previous Message Li Japin 2019-12-12 14:34:20 Duplicate function call on timestamp2tm