Re: Fix number skipping in to_number

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oliver Ford <ojford(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fix number skipping in to_number
Date: 2017-11-15 21:28:48
Message-ID: 5597.1510781328@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Oliver Ford <ojford(at)gmail(dot)com> writes:
>> On Monday, 13 November 2017, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I don't follow your concern? If "$" is not the correct currency
>>> symbol for the locale, we shouldn't accept it as a match to an L format.
>>> Your patch is tightening what we will accept as a match to a G format,
>>> so I don't see why you're concerned about backward compatibility in
>>> one case but not the other.

>> It's a guess as to the likely use case. I would imagine that people are
>> likely to use a currency symbol different from the locale, but unlikely to
>> use a different group separator. Others might have a different opinion
>> though.

> Well, if they use a currency symbol different from the locale's, they're
> in trouble anyway because the number of bytes might be different. In most
> encodings, symbols other than "$" are probably not 1-byte characters.
> At the very least I think we need to constrain it enough that it not
> swallow a fractional character.

After more testing I understood your concern about L_currency_symbol:
in C locale that's " ", not "$" as I naively imagined. Americans,
at least, would be pretty unhappy if "$1234.56" suddenly stopped matching
"L9999.99". So it seems like we can't institute a strict matching rule.
However, it is certainly not good that this happens:

regression=# select to_number('1234.56', 'L9999.99');
to_number
-----------
234.56
(1 row)

To me that seems just as bad as having ',' or 'G' eat a digit.

After some reflection I propose that the rule that we want is:

* ',' and 'G' consume input only if it exactly matches the expected
separator.

* Other non-data template patterns consume a number of input characters
equal to the number of characters they'd produce in output, *except* that
these patterns will not consume data characters (digits, signs, decimal
point, comma).

I think that while we are at it we should take some measures to ensure
that "character" in this definition means "character", not "byte".
It is not good that a euro currency symbol might consume an
encoding-dependent number of input characters.

That leads me to the attached patch. There is more that could be done
here --- in particular, I'd like to see the character-not-byte-count
rule extended to literal text. But that seems like fit material for
a different patch.

Also, I noticed that in your form of the patch, the strncmp() could read
past the end of the string, possibly resulting in a crash. So I made it
use the AMOUNT_TEST infrastructure from NUM_numpart_from_char to avoid that.

One other note: I realized that it was only pure luck that your regression
test cases worked in locales where 'G' is decimal point --- they still
gave the same answer, but through a totally different interpretation of
the input. That did not seem like a good idea, so I adjusted the
regression test to force C locale for the to_number() tests. I wish we
could use some other locale here, but then it likely wouldn't be portable
to Windows.

regards, tom lane

Attachment Content-Type Size
apply-number-v5.patch text/x-diff 17.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-11-15 21:30:28 Re: [HACKERS] Inconsistencies between pg_settings and postgresql.conf
Previous Message Peter Eisentraut 2017-11-15 21:21:56 Re: Transaction control in procedures