Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

From: Imre Samu <pella(dot)samu(at)gmail(dot)com>
To: Debraj Manna <subharaj(dot)manna(at)gmail(dot)com>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x
Date: 2018-09-15 11:21:16
Message-ID: CAJnEWwmQeTB2NzmO2_hU8Q3zkvWremQF91Wu_xZ-Gvwh6R599w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

> Is there any link that I can refer that provides more details about the
differences?

According to the documentation (
https://www.postgresql.org/docs/10/static/locale.html )

*""The locale settings influence the following SQL features:*

-
*Sort order in queries using ORDER BY or the standard comparison operators
on textual data *
-
*The upper, lower, and initcap functions *
-
*Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular
expressions); locales affect both case insensitive matching and the
classification of characters by character-class regular expressions *
-
*The to_char family of functions *
- *The ability to use indexes with LIKE clauses*

*The drawback of using locales other than C or POSIX in PostgreSQL is its
performance impact. It slows character handling and prevents ordinary
indexes from being used by LIKE. For this reason use locales only if you
actually need them.*

*As a workaround to allow PostgreSQL to use indexes with LIKE clauses under
a non-C locale, several custom operator classes exist. These allow the
creation of an index that performs a strict character-by-character
comparison, ignoring locale comparison rules. Refer to Section 11.9 for
more information. Another approach is to create indexes using the C
collation, as discussed in Section 23.2.*""

the performance impact sometimes is huge.
Sorting Geohash with "C" locale is sometimes 40% faster: " ORDER BY
ST_GeoHash(ST_Transform(ST_Envelope(geom),4326),10) COLLATE "C";"
http://www.paulnorman.ca/blog/2016/05/improve-your-st-geohash-sorting-with-these-three-simple-tricks/

Some links:
* https://www.postgresql.org/docs/current/static/collation.html
* https://blog.2ndquadrant.com/icu-support-postgresql-10/ ( More robust
collations with ICU support in PostgreSQL 10 )
*
https://blog.anayrat.info/en/2017/11/19/postgresql-10--icu--abbreviated-keys/
(PostgreSQL 10 : ICU & Abbreviated Keys )

Best,
Imre

2018-09-15 9:02 GMT+02:00 Debraj Manna <subharaj(dot)manna(at)gmail(dot)com>:

> Thanks for replying.
>
> Will there be any other difference like in terms of index size, etc?
>
> Is there any link that I can refer that provides more details about the
> differences?
>
> On Fri 14 Sep, 2018, 5:27 PM Laurenz Albe, <laurenz(dot)albe(at)cybertec(dot)at>
> wrote:
>
>> Debraj Manna wrote:
>> > Can someone let me know what is the difference we are expected to
>> observe if we set
>> > Collate & Ctype to C as against en_US.UTF-8 with encoding set to UTF8
>> in both the cases ?
>>
>> For one, the ordering will be substantially different.
>>
>> Compare the result of these two queries:
>>
>> SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE
>> "C";
>> SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE
>> "en_US.utf8";
>>
>> Yours,
>> Laurenz Albe
>>
>>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Mariel Cherkassky 2018-09-16 09:02:08 Failing to compile sqlite_fdw
Previous Message Ron 2018-09-15 08:21:18 Re: how to debug the postgres performance issue