Re: Collation versioning

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Douglas Doole <dougdoole(at)gmail(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collation versioning
Date: 2020-02-26 14:29:12
Message-ID: CAOBaU_bZKvPF8B53jkrFpaS1y4GYbX7RJh_ZXjpuSOC6ODAcJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 18, 2020 at 2:53 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>
> On Mon, Feb 17, 2020 at 5:58 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >
> > On Thu, Feb 13, 2020 at 8:13 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> > > Hearing no complaints on the suggestions, I'm attaching v8 to address that:
> > >
> > > - pg_dump is now using a binary_upgrade_set_index_coll_version() function
> > > rather than plain DDL
> > > - the additional DDL is now of the form:
> > > ALTER INDEX name ALTER COLLATION name REFRESH VERSION
> >
> > +1
> >
> > A couple of thoughts:
> >
> > @@ -1115,21 +1117,44 @@ index_create(Relation heapRelation,
> > ...
> > + /*
> > + * Get required distinct dependencies on collations
> > for all index keys.
> > + * Collations of directly referenced column in hash
> > indexes can be
> > + * skipped is they're deterministic.
> > + */
> > for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++)
> > {
> > - if (OidIsValid(collationObjectId[i]) &&
> > - collationObjectId[i] != DEFAULT_COLLATION_OID)
> > + Oid colloid = collationObjectId[i];
> > +
> > + if (OidIsValid(colloid))
> > {
> > - referenced.classId = CollationRelationId;
> > - referenced.objectId = collationObjectId[i];
> > - referenced.objectSubId = 0;
> > + if ((indexInfo->ii_Am != HASH_AM_OID) ||
> > +
> > !get_collation_isdeterministic(colloid))
> >
> > I still don't like the way catalog/index.c has hard-coded knowledge of
> > HASH_AM_OID here. Although it errs on the side of the assuming that
> > there *is* a version dependency (good)
>
> Oh, but it also means that it fails to create a versionless
> dependency, which is totally wrong. What we should do is instead
> setup a "track_version" flag to pass down.
>
> It also means that the current way I handled unknown version (empty
> string) vs unknown collation lib version (null) will be problematic,
> both for runtime check and pg_upgrade. I think we should record an
> empty string for both cases, and keep NULL for when explicitly no
> version has to be recorded (whether it's not a dependency on
> collation, or because the depender doesn't care about version).

Fixed this way.

> It
> also mean that I'm missing regression tests using such an access
> method.

Done

> > there is already another AM in
> > the tree that could safely skip it for deterministic collations AFAIK:
> > Bloom indexes. I suppose that any extension AM that is doing some
> > kind of hashing would also like to be able to be able to opt out of
> > collation version checking, when that is safe. The question is how to
> > model that in our system...
>
> Oh indeed.
>
> > One way would be for each AM to declare whether it is affected by
> > collations; the answer could be yes/maybe (default), no,
> > only-non-deterministic-ones. But that still feels like the wrong
> > level, not taking advantage of knowledge about operators.
>
> On the other hand, would it be possible that some AM only supports
> collation-dependency-free operators while still internally relying on
> a stable sort order?
>
> > A better way might be to make declarations about that sort of thing in
> > the catalog, somewhere in the vicinity of the operator classes, or
> > maybe just to have hard coded knowledge about operator classes (ie
> > making declarations in the manual about what eg hash functions are
> > allowed to consult and when), and then check which of those an index
> > depends on. I am not sure what would be best, I'd need to spend some
> > time studying the am operator system.
>
> I think this will be required at some point anyway, if we want to
> eventually avoid warning about possible corruption when an
> expression/where clause isn't depending on the collation ordering.
>
> > Perhaps for the first version of this feature, we should just add a
> > new local function
> > index_can_skip_collation_version_dependency(indexInfo, colloid) to
> > encapsulate your existing logic, but add a comment that in future we
> > might be able to support skipping in more cases through analysis of
> > the catalogs.
>
> That'd be convenient, but would also break extensibility as bloom
> would get a preferential treatment (even if such AM doesn't already
> exist).

I added an index_depends_stable_coll_order() static function for now.

> > + <varlistentry>
> > + <term><literal>ALTER COLLATION</literal></term>
> > + <listitem>
> > + <para>
> > + This command declares that the index is compatible with the currently
> > + installed version of the collations that determine its order. It is used
> > + to silence warnings caused by collation version incompatibilities and
> > + should be called after rebuilding the index or otherwise verifying its
> > + consistency. Be aware that incorrect use of this command can hide index
> > + corruption.
> > + </para>
> > + </listitem>
> > + </varlistentry>
> >
> > This sounds like something that you need to do after you reindex, but
> > that's not true, is it? This is something you can do *instead* of
> > reindex, to make it shut up about versions. Shouldn't it be something
> > like "... should be issued only if the ordering is known not to have
> > changed since the index was built"?
>
> Indeed. We should also probably explicitly mention that if the
> situation is unknown, REINDEX is the safe alternative to choose.

Done.

> > +-- Test ALTER INDEX name ALTER COLLATION name REFRESH VERSION
> > +UPDATE pg_depend SET refobjversion = 'not a version'
> > +WHERE refclassid = 'pg_collation'::regclass
> > +AND objid::regclass::text = 'icuidx17_part'
> > +AND refobjversion IS NOT NULL;
> > +SELECT objid::regclass FROM pg_depend WHERE refobjversion = 'not a version';
> > + objid
> > +---------------
> > + icuidx17_part
> > +(1 row)
> > +
> > +ALTER INDEX icuidx17_part ALTER COLLATION "en-x-icu" REFRESH VERSION;
> > +SELECT objid::regclass FROM pg_depend WHERE refobjversion = 'not a version';
> > + objid
> > +-------
> > +(0 rows)
> > +
> >
> > Would it be better to put refobjversion = 'not a version' in the
> > SELECT list, instead of the WHERE clause, with a WHERE clause that
> > hits that one row, so that we can see that the row still exists after
> > the REFRESH VERSION (while still hiding the unstable version string)?
>
> Agreed, I'll change that.

And done.

An induced change in the attached v10 is that
pg_collation_actual_version() SQL function now also returns an empty
string if the collation version is unknown. That's not strictly
required, but it makes things more consistent and help writing queries
that finds which indexes may be broken. That function's behavior
wasn't documented for unknown version, so I also added some
clarification.

Attachment Content-Type Size
0003-Implement-type-regcollation-v10.patch application/octet-stream 8.9 KB
0001-Remove-pg_collation.collversion-v10.patch application/octet-stream 23.4 KB
0002-Add-pg_depend.refobjversion-v10.patch application/octet-stream 11.6 KB
0004-Track-collation-versions-for-indexes-v10.patch application/octet-stream 61.0 KB
0005-Preserve-index-dependencies-on-collation-during-pg_u-v10.patch application/octet-stream 38.9 KB
0006-Add-a-new-ALTER-INDEX-name-ALTER-COLLATION-name-REFR-v10.patch application/octet-stream 14.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-02-26 14:29:18 Re: Commit fest manager for 2020-03
Previous Message Fujii Masao 2020-02-26 14:18:15 Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side