Re: ICU integration

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ICU integration
Date: 2017-02-16 05:17:53
Message-ID: c288b52e-2b83-8c0c-0f99-b34531763a5a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/31/17 16:59, Thomas Munro wrote:
> I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
> the start of the string via the UCharIterator passed in, unless you
> have the rare reverse-accent-sort feature enabled (maybe used only in
> fr_CA, it looks like it is required to scan the whole string looking
> for the last accent). So I assume that uiter_setUTF8 and
> ucol_nextSortKeyPart would usually do a small fixed amount of work,
> whereas this patch's icu_to_uchar allocates space and converts the
> whole variable length string every time.

I have changed it to use ucol_nextSortKeyPart() when appropriate.

> That's about abbreviation, but I note that you can also compare
> strings using iterators with ucol_strcollIter, avoiding the need to
> allocate and transcode up front. I have no idea whether that'd pay
> off.

These iterators don't actually transcode on the fly. You can only set
up iterators for UTF-8 or UTF-16 strings. And for UTF-8, we already
have a special code path using ucol_strcollUTF8(), which uses an
interator internally.

> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
> and the ICU major version changes we expect to have to REINDEX, but
> does anyone know if such data changes are ever pulled into the minor
> version package upgrades you get from regular apt-get update of (say)
> a Debian 8 or CentOS 7 or FreeBSD 11 system? In other words, do we
> expect collversion changes to happen basically any time in response to
> regular system updates, or only when you're doing a major upgrade of
> some kind, as the above-quoted documentation patch implies?

Stable operating systems shouldn't do major library upgrades, so I feel
pretty confident about this.

>
> + collversion = ntohl(*((uint32 *) versioninfo));
>
> UVersionInfo is an array of four uint8_t. That doesn't sound like
> something that needs to be bit-swizzled... is it really? Even if it
> is arranged differently on different architectures, I'm not sure why
> you care since we only ever use it to compare for equality on the same
> system. But aside from that, I don't love this cast to uint32. I
> wonder if we should use u_versionToString to respect boundaries a bit
> better?

Makes sense, changed the column type to text.

> I have another motivation for wanting to model collation versions as
> strings: I have been contemplating a version check for system-provided
> collations too, piggy-backing on your work here. Obviously PostgreSQL
> can't directly know anything about system collation versions, but I'm
> thinking of a GUC system_collation_version_command which you could
> optionally set to a script that knows how to inspect the local
> operating system. For example, a package maintainer might be
> interested in writing such a script that knows how to ask the package
> system for a locale data version. A brute force approach that works
> on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.

That sounds like an idea worth pursuing.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Seki, Eiji 2017-02-16 05:24:25 Re: Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags
Previous Message Peter Eisentraut 2017-02-16 05:10:33 Re: ICU integration