Quick Links

Re: Allow to_date() and to_timestamp() to accept localized names

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>
Cc:	Arthur Zakirov <zaartur(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Allow to_date() and to_timestamp() to accept localized names
Date:	2020-01-22 22:11:27
Message-ID:	29432.1579731087@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I wrote:
> One thing I completely don't understand is why it's sufficient for
> seq_search_localized to do "initcap" semantics. Shouldn't it cover
> the all-upper and all-lower cases as well, as the existing seq_search
> code does? (That is, why is it okay to ignore the "type" argument?)

I took a closer look at this and decided you were probably doing that
just because the seq_search code uses initcap-style case folding to
match month and day names, relying on the assumption that the constant
arrays it's passed are cased that way. But we shouldn't (and the patch
doesn't) assume that the localized names we'll get from pg_locale.c are
cased that way.

However ... it seems to me that the way seq_search is defined is
plenty bogus. In the first place, it scribbles on its source string,
which is surely not okay. You can see that in error messages that
are thrown on match failures:

regression=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY');
ERROR: invalid value "JanuzRY 1" for "MONTH"
DETAIL: The given value did not match any of the allowed values for this field.

Now, maybe somebody out there thinks this is a cute way of highlighting
how much of the string we examined, but it looks more like a bug from
where I sit. It's mere luck that what's being clobbered is a local
string copy created by do_to_timestamp(), and not the original input
data passed to to_date().

In the second place, there's really no need at all for seq_search to
rely on any assumptions about the case state of the array it's
searching. pg_toupper() is pretty cheap already, and we can make it
guaranteed cheap if we use pg_ascii_toupper() instead. So I think
we ought to just remove the "type" argument altogether and have
seq_search dynamically convert all the strings it examines to upper
case (or lower case would work as well, at least for English).

> On the other hand, you probably *should* be ignoring the "max"
> argument, because AFAICS the values that are passed for that are
> specific to the English ASCII spellings of the day and month names.
> It seems very likely that they could be too small for some sets of
> non-English names.

Closer examination shows that the "max" argument is pretty bogus as
well. It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior. So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages. There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

I also noted while I was looking at it that from_char_seq_search()
would truncate at "max" bytes even when dealing with multibyte input.
That has a clear risk of generating an invalidly-encoded error message.

The 0001 patch attached cleans up all the above complaints. I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.

I'm tempted to go a bit further than 0001 does, and remove the 'max'
argument from from_char_seq_search() altogether. Since we only need
'max' in error cases, which don't need to be super-quick, we could drop
the requirement for the caller to specify that and instead compute it
when we do need it as the largest of the constant array's string
lengths. That'd carry over into the localized-names case in a
reasonably straightforward way (though we might need to count characters
not bytes for that to work nicely).

Because of the risk of incorrectly-encoded error messages, I'm rather
tempted to claim that these patches (or at least 0001) are a bug fix
and should be back-patched. In any case I think we should apply this
to HEAD and then rebase the localized-names patch over it. It makes
a whole lot more sense, IMO, for the localized-names comparison logic
to match what this is doing than what seq_search does today.

Comments?

regards, tom lane

Attachment	Content-Type	Size
0001-formatting-fix-seq-search.patch	text/x-diff	8.8 KB
0002-formatting-constify.patch	text/x-diff	3.9 KB

In response to

Re: Allow to_date() and to_timestamp() to accept localized names at 2020-01-21 23:20:24 from Tom Lane

Responses

Re: Allow to_date() and to_timestamp() to accept localized names at 2020-01-23 03:17:22 from Arthur Zakirov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Gustafsson	2020-01-22 22:18:12	Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
Previous Message	Robert Haas	2020-01-22 22:07:58	Re: Online checksums patch - once again