Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific

From: Evan Jones <evan(dot)jones(at)datadoghq(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Date: 2023-10-10 14:51:10
Message-ID: CA+HWA9aN-M1O-9Ma=_Pqz-uwzDA07DEk+pui7Zy7K7-Y1PjpUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for bringing this up! I just looked at the uses if isspace() in that
file. It looks like it is the usual thing: it is allowing leading or
trailing whitespace when parsing values, or for this "needs quoting" logic
on output. The fix would be the same: this *should* be
using scanner_isspace. This has the same disadvantage: it would change
Postgres's results for some inputs that contain these non-ASCII "space"
characters.

Here is a quick demonstration of this issue, showing that the quoting
behavior is different between these two. Mac OS X with the "default" locale
includes quotes because ą includes 0x85 in its UTF-8 encoding:

postgres=# SELECT ROW('keyą');
row
----------
("keyą")
(1 row)

On Mac OS X with the LANG=C environment variable set, it does not include
quotes:

postgres=# SELECT ROW('keyą');
row
--------
(keyą)
(1 row)

On Mon, Oct 9, 2023 at 11:18 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> FTR I ran into a benign case of the phenomenon in this thread when
> dealing with row types. In rowtypes.c, we double-quote stuff
> containing spaces, but we detect them by passing individual bytes of
> UTF-8 sequences to isspace(). Like macOS, Windows thinks that 0xa0 is
> a space when you do that, so for example the Korean character '점'
> (code point C810, UTF-8 sequence EC A0 90) gets quotes on Windows but
> not on Linux. That confused a migration/diff tool while comparing
> Windows and Linux database servers using that representation. Not a
> big deal, I guess no one ever promised that the format was stable
> across platforms, and I don't immediately see a way for anything more
> serious to go wrong (though I may lack imagination). It does seem a
> bit weird to be using locale-aware tokenising for a machine-readable
> format, and then making sure its behaviour is undefined by feeding it
> chopped up bytes.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2023-10-10 15:15:36 Re: Fwd: Advice about preloaded libraries
Previous Message Robert Haas 2023-10-10 14:50:55 Re: On login trigger: take three