Re: Why do indexes and sorts use the database collation?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why do indexes and sorts use the database collation?
Date: 2023-11-13 23:02:13
Message-ID: e23815b9-3ed4-2ed0-62b0-e36648d35e5f@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/13/23 23:12, Andres Freund wrote:
> Hi,
>
> On 2023-11-13 22:36:24 +0100, Tomas Vondra wrote:
>> I don't think we can just arbitrarily override the default because we
>> happen to think "C" is going to be faster. If we could prove that using
>> "C" is going to produce exactly the same results as for the implicit
>> collation (for a given operation), then we can simply do that. Not sure
>> if such proof is possible, though.
>
> Yea, I don't know if there's any interesting cases where we could prove that.
>
> I think there *are* interesting cases where we should prove that non-C
> collations are identical. It's imo bonkers that we consider the
> "collname.encname" collations distinct from the equivalent "collname"
> collation.
>

Yeah, I agree that seems a bit ... strange.

> We document that we consider "collname" equivalent to
> "collname.<database encoding>":
>
>> Within any particular database, only collations that use that database's
>> encoding are of interest. Other entries in pg_collation are ignored. Thus, a
>> stripped collation name such as de_DE can be considered unique within a
>> given database even though it would not be unique globally. Use of the
>> stripped collation names is recommended, since it will make one fewer thing
>> you need to change if you decide to change to another database
>> encoding. Note however that the default, C, and POSIX collations can be used
>> regardless of the database encoding.
>
> Followed by:
>
>
>> PostgreSQL considers distinct collation objects to be incompatible even when
>> they have identical properties. Thus for example, [...] Mixing stripped and
>> non-stripped collation names is therefore not recommended.
>
> Why on earth are we solving this by having multiple pg_collation entries for
> exactly the same collation, instead of normalizing the collation-name during
> lookup by adding the relevant encoding name if not explicitly specified? It
> makes a lot of sense to not force the user to specify the encoding when it
> can't differ.
>

True, insisting on having multiple separate entries for the same
collation (and not recognizing which collations are the same) seems
somewhat inconvenient.

>
> It's imo similarly absurd that an index with "default" collation cannot be
> used when specifying the equivalent collation explicitly in the query and vice
> versa.
>

Right. Having to spell

COLLATE "default"

and not the actual collation it references to is weird. Similarly, I
just realized that the collation name in pg_database and pg_collation
are not quite consistent. Consider this:

select datcollate from pg_database where datname = 'test';

datcollate
------------
C.UTF-8
(1 row)

but then

test=# select * from t where c = 'x' collate "C.UTF-8";
ERROR: collation "C.UTF-8" for encoding "UTF8" does not exist
LINE 1: select * from t where c = 'x' collate "C.UTF-8";

because the collation is actually known as C.utf8.

>
>
>
>>>>> Also, wouldn't the intent to use a different collation for the column
>>>>> be
>>>>> expressed by changing the column's collation?
>>>>
>>>> The column collation expresses the semantics of that column. If the
>>>> user has a database collation of "en_US", they should expect ORDER BY
>>>> on that column to be according to that locale unless otherwise
>>>> specified.
>>>
>>> That makes no sense to me. Either the user cares about ordering, in which case
>>> the index needs to be in that ordering for efficient ORDER BY, or they don't,
>>> in which neither index nor column needs a non-C collation. You partially
>>> premised your argument on the content of primary keys typically making non-C
>>> collations undesirable!
>>>
>>
>> I may be missing something, but what's the disagreement here? If the
>> user cares about ordering, they'll specify ORDER BY with either an
>> explicit or the default collation. If the index collation matches, it
>> may be useful for the ordering.
>>
>> Of course, if we feel entitled to create the primary key index with a
>> collation of our choosing, that'd make this unpredictable.
>
> Jeff was saying that textual primary keys typically don't need sorting and
> because of that we could default to "C", for performance. Part of my response
> was that I think the user's intent could be expressed by specifying the column
> collation as "C" - to which Jeff replied that that would change the
> semantics. Which, to me, seems to completely run counter to his argument that
> we could just use "C" for such indexes.
>

True. I think that's somewhat self-contradictory argument.

It's not clear to me if the argument is meant to apply to indexes on all
columns or just those backing primary keys, but I guess it's the latter.
But that (forcing users to specify collation for PK columns, while using
the default for non-PK columns) seems like a recipe for subtle bugs in
applications.

>
>
>>>>> - Teach the planner to use cheaper collations when ordering for
> g> >>> reasons other
>>>>>   than the user's direct request (e.g. DISTINCT/GROUP BY, merge
>>>>> joins).
>>>>
>>>> +1. Where "cheaper" comes from is an interesting question -- is it a
>>>> property of the provider or the specific collation? Or do we just call
>>>> "C" special?
>>>
>>> I'd think the specific collation. Even if we initially perhaps just get the
>>> default cost from the provider such, it structurally seems the sanest place to
>>> locate the cost.
>>>
>>
>> ISTM it's about how complex the rules implemented by the collation are,
>> so I agree the cost should be a feature of collations not providers.
>
> I'm not sure analysing the complexity in detail is worth it. ISTM there's a
> few "levels" of costliness:
>
> 1) memcmp() suffices
> 2) can safely use strxfrm() (i.e. ICU), possibly limited to when we sort
> 3) deterministic collations
> 4) non-deterministic collations
>
> I'm sure there are graduations, particularly within 3), but I'm not sure it's
> realistic / worthwhile to go to that detail. I think a cost model like the
> above would provide enough detail to make better decisions than today...
>

I'm not saying we have to analyze the complexity of the rules. I was
simply agreeing with you that the "cost" should be associated with
individual collations, not the providers.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-11-13 23:19:50 Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)
Previous Message Peter Geoghegan 2023-11-13 22:58:44 Re: Emit fewer vacuum records by reaping removable tuples during pruning