Re: Collation versioning

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Douglas Doole <dougdoole(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation versioning
Date: 2020-02-18 13:53:17
Message-ID: CAOBaU_Z5BPNs4f_aJLdzCzUjCZKHJwRxu5W5CqH_q8jV18+NPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 17, 2020 at 5:58 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Thu, Feb 13, 2020 at 8:13 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> > Hearing no complaints on the suggestions, I'm attaching v8 to address that:
> >
> > - pg_dump is now using a binary_upgrade_set_index_coll_version() function
> > rather than plain DDL
> > - the additional DDL is now of the form:
> > ALTER INDEX name ALTER COLLATION name REFRESH VERSION
>
> +1
>
> A couple of thoughts:
>
> @@ -1115,21 +1117,44 @@ index_create(Relation heapRelation,
> ...
> + /*
> + * Get required distinct dependencies on collations
> for all index keys.
> + * Collations of directly referenced column in hash
> indexes can be
> + * skipped is they're deterministic.
> + */
> for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++)
> {
> - if (OidIsValid(collationObjectId[i]) &&
> - collationObjectId[i] != DEFAULT_COLLATION_OID)
> + Oid colloid = collationObjectId[i];
> +
> + if (OidIsValid(colloid))
> {
> - referenced.classId = CollationRelationId;
> - referenced.objectId = collationObjectId[i];
> - referenced.objectSubId = 0;
> + if ((indexInfo->ii_Am != HASH_AM_OID) ||
> +
> !get_collation_isdeterministic(colloid))
>
> I still don't like the way catalog/index.c has hard-coded knowledge of
> HASH_AM_OID here. Although it errs on the side of the assuming that
> there *is* a version dependency (good)

Oh, but it also means that it fails to create a versionless
dependency, which is totally wrong. What we should do is instead
setup a "track_version" flag to pass down.

It also means that the current way I handled unknown version (empty
string) vs unknown collation lib version (null) will be problematic,
both for runtime check and pg_upgrade. I think we should record an
empty string for both cases, and keep NULL for when explicitly no
version has to be recorded (whether it's not a dependency on
collation, or because the depender doesn't care about version). It
also mean that I'm missing regression tests using such an access
method.

> there is already another AM in
> the tree that could safely skip it for deterministic collations AFAIK:
> Bloom indexes. I suppose that any extension AM that is doing some
> kind of hashing would also like to be able to be able to opt out of
> collation version checking, when that is safe. The question is how to
> model that in our system...

Oh indeed.

> One way would be for each AM to declare whether it is affected by
> collations; the answer could be yes/maybe (default), no,
> only-non-deterministic-ones. But that still feels like the wrong
> level, not taking advantage of knowledge about operators.

On the other hand, would it be possible that some AM only supports
collation-dependency-free operators while still internally relying on
a stable sort order?

> A better way might be to make declarations about that sort of thing in
> the catalog, somewhere in the vicinity of the operator classes, or
> maybe just to have hard coded knowledge about operator classes (ie
> making declarations in the manual about what eg hash functions are
> allowed to consult and when), and then check which of those an index
> depends on. I am not sure what would be best, I'd need to spend some
> time studying the am operator system.

I think this will be required at some point anyway, if we want to
eventually avoid warning about possible corruption when an
expression/where clause isn't depending on the collation ordering.

> Perhaps for the first version of this feature, we should just add a
> new local function
> index_can_skip_collation_version_dependency(indexInfo, colloid) to
> encapsulate your existing logic, but add a comment that in future we
> might be able to support skipping in more cases through analysis of
> the catalogs.

That'd be convenient, but would also break extensibility as bloom
would get a preferential treatment (even if such AM doesn't already
exist).

>
> + <varlistentry>
> + <term><literal>ALTER COLLATION</literal></term>
> + <listitem>
> + <para>
> + This command declares that the index is compatible with the currently
> + installed version of the collations that determine its order. It is used
> + to silence warnings caused by collation version incompatibilities and
> + should be called after rebuilding the index or otherwise verifying its
> + consistency. Be aware that incorrect use of this command can hide index
> + corruption.
> + </para>
> + </listitem>
> + </varlistentry>
>
> This sounds like something that you need to do after you reindex, but
> that's not true, is it? This is something you can do *instead* of
> reindex, to make it shut up about versions. Shouldn't it be something
> like "... should be issued only if the ordering is known not to have
> changed since the index was built"?

Indeed. We should also probably explicitly mention that if the
situation is unknown, REINDEX is the safe alternative to choose.

> +-- Test ALTER INDEX name ALTER COLLATION name REFRESH VERSION
> +UPDATE pg_depend SET refobjversion = 'not a version'
> +WHERE refclassid = 'pg_collation'::regclass
> +AND objid::regclass::text = 'icuidx17_part'
> +AND refobjversion IS NOT NULL;
> +SELECT objid::regclass FROM pg_depend WHERE refobjversion = 'not a version';
> + objid
> +---------------
> + icuidx17_part
> +(1 row)
> +
> +ALTER INDEX icuidx17_part ALTER COLLATION "en-x-icu" REFRESH VERSION;
> +SELECT objid::regclass FROM pg_depend WHERE refobjversion = 'not a version';
> + objid
> +-------
> +(0 rows)
> +
>
> Would it be better to put refobjversion = 'not a version' in the
> SELECT list, instead of the WHERE clause, with a WHERE clause that
> hits that one row, so that we can see that the row still exists after
> the REFRESH VERSION (while still hiding the unstable version string)?

Agreed, I'll change that.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mike Blackwell 2020-02-18 14:21:43 Re: Parallel copy
Previous Message Greg Stark 2020-02-18 13:29:39 Re: Index only scan and ctid