Re: Fix number skipping in to_number

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oliver Ford <ojford(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fix number skipping in to_number
Date: 2017-11-17 22:28:23
Message-ID: 28186.1510957703@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> That leads me to the attached patch. There is more that could be done
> here --- in particular, I'd like to see the character-not-byte-count
> rule extended to literal text. But that seems like fit material for
> a different patch.

Attached is a patch that makes formatting.c more multibyte-aware;
it now handles multibyte characters as single NODE_TYPE_CHAR format
nodes, rather than one node per byte. This doesn't really have much
impact on the output (to_char) side, but on the input side, it
greatly simplifies treating such characters as single characters
rather than multiple ones. An example is that (in UTF8 encoding)
previously we got

u8=# select to_number('$12.34', '€99D99');
to_number
-----------
0.34
(1 row)

because the literal euro sign is 3 bytes long and was thought to be
3 literal characters. Now we get the expected result

u8=# select to_number('$12.34', '€99D99');
to_number
-----------
12.34
(1 row)

Aside from skipping 1 input character (not byte) per literal format
character, I fixed the SKIP_THth macro, allowing to_date/to_timestamp to
also follow the rule of skipping whole characters not bytes for non-data
format patterns. There might be some other places that need similar
adjustments, but I couldn't find any.

Not sure about whether/how to add regression tests for this; it's really
impossible to add specific tests in an ASCII-only test file. Maybe we
could put a test or two into collate.linux.utf8.sql, but it seems a bit
off topic for that, and I think that test file hardly gets run anyway.

Note this needs to be applied over the patch I posted at
https://postgr.es/m/3626.1510949486@sss.pgh.pa.us
I intend to commit that fairly soon, but it's not in right now.

regards, tom lane

Attachment Content-Type Size
fix-multibyte-literal-chars-in-formatting.c.patch text/x-diff 7.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Schneider 2017-11-17 22:46:42 Re: [HACKERS] pg_upgrade to clusters with a different WAL segment size
Previous Message Peter Geoghegan 2017-11-17 22:22:23 Re: [HACKERS] Parallel Hash take II