Re: Change initdb default to the builtin collation provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Change initdb default to the builtin collation provider
Date: 2026-03-12 19:20:07
Message-ID: e364efcfa40af47aca9071ea81b38ce5573556fe.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2026-03-12 at 10:04 -0400, Robert Haas wrote:
> Yes. I think actually one of the big challenges right now is making
> sure that when you initdb to do a pg_upgrade, you get the right
> settings to make the upgrade work.

pg_upgrade should copy the locale settings to the new cluster as of
9637badd9f. If there are still some rough edges here, let me know.

>
> I don't have total information, but I think they mostly use a single
> locale. If they have extremely specific needs, they are likely to end
> up with ICU, else they pick a glibc locale. I have no idea how likely
> that glibc locale is to match their environment. I wouldn't bet on it
> being the norm, but I wouldn't bet against whatever they have in the
> environment being more usable than "C".

That's interesting. In other words, (in your sample) users aren't
worried about the precise sort order in their native language; it's
just that ASCII is particularly bad, and almost any "real" locale is
more appealing.

If the concern is mostly that ASCII is particularly bad, how much of
that is because case is a high-order bit (i.e. 'A' < 'Z' < 'a' < 'z')?

> >
> It's tough if people have range scans.

Range scans using a natural language collation are dubious. It can't be
for a prefix search; LIKE 'myprefix%' needs the index to be defined
with text_pattern_ops (which is code point order), so the default isn't
going to work for them anyway.

(A prefix search can't be implemented with a range scan in natural
language collation because, e.g. in the cs_CZ locale, 'cha' does not
fall between 'ch' and 'ci'.)

So how often is a range scan using a natural language collation
actually useful? I'm sure there are some real cases, but I'd say it's
usually a mistake and they are quite possibly getting wrong results.

> Not everybody does, but they
> also don't know whether or not they will want them when they're
> making
> setup choices. Picking a locale that matches their desired sort order
> *in case* they end up using range scans in some queries feels like
> the
> "safe" coice.

I have trouble understanding this perspective: slow all indexes down
(and sorts, too), and risk index inconsistencies just in case someone
ends up doing a range scan on one of the indexes? How is that safer?

> >
> What I'm
> most worried about is the population of users -- which I guess to be
> large -- who do not have a strong preference but won't be happy with
> something as dumb as "C". If even a small fraction of users create a
> database using "C" unintentionally and load a terabyte of data into
> it
> before realizing that all their text indexes are sorting "wrong", I
> suspect that's not going to be much fun.

This is where we differ: even in that case, I believe all (or nearly
all) of that user's indexes would be better.

When you look at the conditions that must be true for an index with a
natural language collation to be useful, it's certainly not the normal
case, and I'd bet it's closer to "rare":

* the use case must be real (not relying on faulty assumptions about
lexicographical ordering)
* the input data must be large enough to benefit from an index scan
* one of the following must be true:
- the index needs to be correlated with the heap order (seems
unlikely; correlation usually happens with sequences, timestamps, etc.,
not natural language text values); or
- needs to be eligible for an index only scan (plausible); or
- the amount of data read must be small enough that correlation
with the heap doesn't matter
* the result data needs to be small enough for a human to consume it
(otherwise why bother with natural language?)
* the performance improvement must be enough to offset the penalty
for equality searches and index maintenance

While each of those is plausible, when combined, I think it's far from
the typical case.

It's perfectly reasonable to say the user may be upset about the way
the final result order looks, but making all the index orderings worse
is not a good way to fix that.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2026-03-12 19:20:08 Re: another autovacuum scheduling thread
Previous Message Robert Treat 2026-03-12 19:15:13 Re: Adding REPACK [concurrently]