Re: Collation versioning

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Douglas Doole <dougdoole(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation versioning
Date: 2019-12-17 13:25:14
Message-ID: CAOBaU_bP70ybzc23E=Z38mygMAOThPG7Fog1Qoom5DNvsfq=Ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 12, 2019 at 3:36 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>
> Hello Thomas,
>
> Thanks for looking at it!
>
> On Thu, Dec 12, 2019 at 5:01 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >
> > We create duplicate pg_depend records:
> >
> > [...]
> >
> > I wondered if that was harmless
>
> That's the assumed behavior of recordMultipleDependencies:
>
> /*
> * Record the Dependency. Note we don't bother to check for
> * duplicate dependencies; there's no harm in them.
> */
>
> We could add a check to skip duplicates for the "track_version ==
> true" path, or switch to flags if we want to also skip duplicates in
> other cases, but it'll make recordMultipleDependencies a little bit
> more specialised.
>
> > but for one thing it causes duplicate warnings:
>
> Yes, that should be avoided.
>
> > Here's another way to get a duplicate, and in this example you also
> > get an unnecessary dependency on 100 "default" for this index:
> >
> > postgres=# create index on t(x collate "fr_FR") where x > 'helicopter'
> > collate "fr_FR";
> > CREATE INDEX
> > postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> > and refobjversion != '';
> > classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> > refobjversion | deptype
> > ---------+-------+----------+------------+----------+-------------+---------------+---------
> > 1259 | 16460 | 0 | 3456 | 12603 | 0 |
> > 0:34.0 | n
> > 1259 | 16460 | 0 | 3456 | 12603 | 0 |
> > 0:34.0 | n
> > 1259 | 16460 | 0 | 3456 | 100 | 0 |
> > 0:34.0 | n
> > (3 rows)
> >
> > Or... maybe 100 should be there, by simple analysis of the x in the
> > WHERE clause, but it's the same if you write x collate "fr_FR" >
> > 'helicopter' collate "fr_FR", and in that case there are no
> > expressions of collation "default" anywhere.
>
> Ah good point. That's because expression_tree_walker() will dig into
> CollateExpr->args and eventually reach the underlying Var. I don't
> see an easy way to avoid that while still properly recording the
> required dependency for an even more realistic index such as
>
> CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" >
> 'helicopter' COLLATE "en_US")::text) collate "fr_FR";
>
> and for instance not for
>
> CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" ||
> 'helicopter' COLLATE "en_US")) collate "fr_FR";
>
>
> > The indirection through composite types works nicely:
> >
> > postgres=# create type foo_t as (en text collate "en_CA", fr text
> > collate "fr_CA");
> > CREATE TYPE
> > postgres=# create table t (foo foo_t);
> > CREATE TABLE
> > postgres=# create index on t(foo);
> > CREATE INDEX
> > postgres=# select * from pg_depend where objid = 't_foo_idx'::regclass
> > and refobjversion != '';
> > classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> > refobjversion | deptype
> > ---------+-------+----------+------------+----------+-------------+---------------+---------
> > 1259 | 16444 | 0 | 3456 | 12554 | 0 |
> > 0:34.0 | n
> > 1259 | 16444 | 0 | 3456 | 12597 | 0 |
> > 0:34.0 | n
> > (2 rows)
> >
> > ... but again it shows the extra and technically unnecessary
> > dependencies (only 12603 "fr_FR" is really needed):
> >
> > postgres=# create index on t(((foo).fr collate "fr_FR"));
> > CREATE INDEX
> > postgres=# select * from pg_depend where objid = 't_fr_idx'::regclass
> > and refobjversion != '';
> > classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> > refobjversion | deptype
> > ---------+-------+----------+------------+----------+-------------+---------------+---------
> > 1259 | 16445 | 0 | 3456 | 12603 | 0 |
> > 0:34.0 | n
> > 1259 | 16445 | 0 | 3456 | 12597 | 0 |
> > 0:34.0 | n
> > 1259 | 16445 | 0 | 3456 | 12554 | 0 |
> > 0:34.0 | n
> > (3 rows)
>
> Yes :(
>
> > I check that nested types are examined recursively, as appropriate. I
> > also tested domains, arrays, arrays of domains, expressions extracting
> > an element from an array of a domain with an explicit collation, and
> > the only problem I could find was more ways to get duplicates. Hmm...
> > what else is there that can contain a collatable type...? Ranges!
> >
> > postgres=# create type myrange as range (subtype = text);
> > CREATE TYPE
> > postgres=# drop table t;
> > DROP TABLE
> > postgres=# create table t (x myrange);
> > CREATE TABLE
> > postgres=# create index on t(x);
> > CREATE INDEX
> > postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> > and refobjversion != '';
> > classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> > refobjversion | deptype
> > ---------+-------+----------+------------+----------+-------------+---------------+---------
> > (0 rows)
> >
> > ... or perhaps, more realistically, a GIST index might actually be
> > useful for range queries, and we're not capturing the dependency:
> >
> > postgres=# create index t_x_idx on t using gist (x);
> > CREATE INDEX
> > postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> > and refobjversion != '';
> > classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> > refobjversion | deptype
> > ---------+-------+----------+------------+----------+-------------+---------------+---------
> > (0 rows)
>
> Good catch :) I fixed it locally and checked that a gist index on a
> range with a subtype being a composite type does record the required
> dependencies.
>
> > The new syntax "ALTER INDEX i_name DEPENDS ON ANY COLLATION UNKNOWN
> > VERSION" doesn't sound good to me, it's not "ANY" collation, it's a
> > specific set of collations that we aren't listing. "ALTER INDEX
> > i_name DEPENDS ON COLLATION * VERSION UNKNOWN", hrmph, no that's
> > terrible... I'm not sure what would be better.
>
> Mmm, indeed. With a 3rd round in the existing keyword, how about
> "DEPENDS ON [ ANY ] REFERENCING COLLATION"? The ANY is mostly to
> avoid the need for plural.
>
> > I'm not sure if I like the idea of VACUUM reporting warnings or not. Hmm.
>
> Even if I add this in a IsAutoVacuumWorkerProcess?
>
> > To state more explicitly what's happening here, we're searching the
> > expression trees for subexpresions that have a collation as part of
> > their static type. We don't know which functions or operators are
> > actually affected by the collation, though. For example, if an
> > expression says "x IS NOT NULL" and x happens to be a subexpression of
> > a type with a particular collation, we don't now that this
> > expression's value can't possibly be affected by the collation version
> > changing. So, the system will nag you to rebuild an index just
> > because you mentioned it, even though the index can't be corrupted.
> > To do better than that, I suppose we'd need declarations in the
> > catalog to say which functions/operators are collation sensitive.
>
> Wouldn't that still be a problem for an absurd expression like
>
> WHERE length((val collate "en_US" > 'uh' collate "en_US")::text) > 0
>
>
> And since we would still have to record a dependency on the collation
> in such case, we would need to have another magic value to distinguish
> "unknown" from "cannot cause corruption" collation version.

PFA rebased v6, with the following changes:

- collation for range types is handled
- duplicated dependencies aren't recorded anymore
- relation column underlying type's collation isn't recorded if it's
used in an index expression and is directly under a CollateExpr node.
This should be safe as the required dependency is already recorded for
the relation's column, and should avoid most of the false positive
warning. With this modification, the following use case:

CREATE TABLE test (id integer, val text collate "en-x-icu"); CREATE
INDEX ON test ( val collate "ga-x-icu" ) WHERE ((val collate
"fr-x-icu") collate "es-x-icu") is not null;

stores the following dependencies for collations:

SELECT refobjid::regcollation, objid::regclass FROM pg_depend WHERE
objid::regclass::text ILIKE 'test%' AND refclassid = 3456;
refobjid | objid
------------+--------------
"en-x-icu" | test
"ga-x-icu" | test_val_idx
"fr-x-icu" | test_val_idx
"es-x-icu" | test_val_idx
(4 rows)

Unless we add a distinct catalog for version dependencies or a
specific magic value, we have to record the "fr-x-icu" and "ex-x-icu"
dependencies.

- for hash indexes, deterministic collations aren't track if it's a
simple key column. I added a new "non_deterministic_only" parameter
to GetTypeCollations() and modified the loop over key attributes in
index_create(). With the previous example modified to be a hash
index, "ga-x-icu" wouldn't be recorded.

Attachment Content-Type Size
0001-Remove-pg_collation.collversion-v6.patch application/octet-stream 23.4 KB
0004-Implement-type-regcollation-v6.patch application/octet-stream 8.9 KB
0005-Preserve-index-dependencies-on-collation-during-pg_u-v6.patch application/octet-stream 27.5 KB
0003-Track-collation-versions-for-indexes-v6.patch application/octet-stream 40.5 KB
0002-Add-pg_depend.refobjversion-v6.patch application/octet-stream 10.3 KB
0006-Add-a-new-ALTER-INDEX-name-DEPENDS-ON-COLLATION-name-v6.patch application/octet-stream 8.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-12-17 13:38:13 Re: non-exclusive backup cleanup is mildly broken
Previous Message Masahiko Sawada 2019-12-17 12:37:10 Re: [HACKERS] Block level parallel vacuum