Quick Links

Re: Giving the shared catalogues a defined encoding

From:	Nico Williams <nico(at)cryptonector(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: Giving the shared catalogues a defined encoding
Date:	2025-04-17 19:14:37
Message-ID:	aAFTHQR+/v2573XX@ubby
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Dec 10, 2024 at 02:29:09AM +1300, Thomas Munro wrote:
> Here are some things I have learned about pathname encoding:
>
> * Some systems enforce an encoding: macOS always requires UTF-8, ext4
> does too if you turn on case insensitivity, zfs has a utf8only option,
> and a few less interesting-to-us ones have relevant mount options. On
> the first three at least: open("cafe\xe9", ...) -> EILSEQ, independent
> of user space notions like locales.

Watch out: OS X normalizes to NFD on create. I.e., it doesn't preserve
form on disk. ZFS on OS X follows the when-in-Rome principle and does
this too. NFD was a poor choice because all input methods tend to
produce forms closer to NFC, including OS X's input methods. This means
that a `memcmp()`-like string comparison of a user input and an
_equivalent_ filename obtained from a directory listing may not match.

Elsewhere ZFS is form-preserving, with form-insensitive matching. This
interops well because input methods tend to produce the same forms for
the same strings. (Normalizing to NFC on create would probably have
been good enough, but at the time I insisted on form-preserving /
form-insensitive, and I still think that was the best option.)

If the ZFS utf8only is off it still does the form-preserving / form-
insensitive thing for path components that are not invalid UTF-8.

> [...]
> All of that is perfectly reasonable I think, I just want to highlight
> the cascading effect of the new constraint: Apple's file system
> restricts your *database* encoding, with this design, unless you stick
> to plain ASCII pathnames. It is an interesting case to compare with
> when untangling the Windows mess, see below...
>
> * Traditional Unix filesystems eg ext4/xfs/ufs/... just don't care:
> beyond '/' being special, the encoding is in the eye of the beholder.

Correct, though there's another ASCII codepoint that all Unix
filesystems always treat specially: NUL :) (And as you point out newline
in filenames can be a problem and is discouraged but generally not
forbidden or treated specially by the filesystem system calls nor the
filesystems themselves.)

I call this "just-use-8" behavior.

No Unix C library bothers to implement a UTF-8 convention for paths by
doing codeset conversions when running in non-UTF-8 locales. So the
only reasonable way to do I18N interop on Unix is to stick *strictly* to
UTF-8 locales only.

> * Windows has a completely different model. Pathnames are really
> UTF-16 in the kernel and on disk. All char strings exchanged with the
> system have a defined encoding, but it was non-obvious to this humble
> Unix hacker what it is in each case. I don't have Windows, so I spent
> the afternoon firing test code at CI[1][2] to figure some of it out.

Historically on Windows NT and up the filesystem system calls and the
filesystems themselves are "just-use-16", with the convention that the
applications and the C runtime will be using UTF-16.

Since nowadays Unix systems strongly prefer UTF-8 locales, the Windows
convention is not that different from the Unix one in practice.

> * What I'm wondering is whether we can instead achieve coherence along
> the lines of the Apple UTF-8 case I described above, but with an extra
> step: if you want to use non-ASCII paths *you have to make your ACP
> match the database and cluster encoding*. So either you go all-in on
> your favourite 80s encoding like WIN1252 that matches your ACP
> (impossible for 932 AKA SJIS), or you switch your system's ACP to
> UTF-8. Alternatively, I believe postgres.exe could even be built in a
> way that makes its ACP always UTF-8[3] (I guess the loader has to help
> with that as it affects the way it sets up environ[] and argv[] before
> main() runs). I don't know all the consequences though. And I don't
> know what exact rules would be best, but something like that would be
> in keeping with the general philosophy of this project: just figure
> out how to block the combinations that don't work correctly.

Can you insist that for new DBs `initdb`/`createdb`/`postgres` run only
in Unicode locales on Windows, and with the UTF-8 codepage? Do you have
to support older Windows releases?

> (HBA content is also an interesting topic.)

I bet.

Nico
--

In response to

Re: Giving the shared catalogues a defined encoding at 2024-12-09 13:29:09 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-04-17 19:59:28	SQL functions: avoid making a tuplestore unnecessarily
Previous Message	Alexander Lakhin	2025-04-17 19:00:00	Re: WaitEventSetWaitBlock() can still hang on Windows due to connection reset