Re: Allow to_date() and to_timestamp() to accept localized names

From: Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Arthur Zakirov <zaartur(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Allow to_date() and to_timestamp() to accept localized names
Date: 2020-01-24 09:48:19
Message-ID: CAC+AXB1fQtg-iWH=4qZnV0xYm08WR6W7AhoApKA3tkFsWgf90A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 23, 2020 at 11:00 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

Thank you for your time looking into this.

Here's a v7 patch, rebased over my recent hacking, and addressing
> most of the complaints I raised in <31691(dot)1579648824(at)sss(dot)pgh(dot)pa(dot)us>.
> However, I've got some new complaints now that I've looked harder
> at the code:
>
> * It's not exactly apparent to me why this code should be concerned
> about non-normalized characters when noplace else in the backend is.
> I think we should either rip that out entirely, or move the logic
> into str_tolower (and hence also str_toupper etc). I'd tend to favor
> the former, but of course I don't speak any languages where this would
> be an issue. Perhaps a better idea is to invent a new SQL function
> that normalizes a given string, and expect users to call that first
> if they'd like to_date() to match unnormalized text.
>
>
There is an open patch that will make the normalization functionality user
visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
TMMON YYYY') I would vote to drop the normalization logic inside this patch
altogether.

* I have no faith in this calculation that decides how long the match
> length was:
>
> *len = element_len + name_len - norm_len;
>
> I seriously doubt that this does the right thing if either the
> normalization or the downcasing changed the byte length of the
> string. I'm not actually sure how we can do that correctly.
> There's no good way to know whether the changes happened within
> the part of the "name" string that we matched, or the part beyond
> what we matched, but we only want to count the former. So this
> seems like a pretty hard problem, and even if this logic is somehow
> correct as it stands, it needs a comment explaining why.
>
>
The proper logic would come from do_to_timestamp() receiving a normalized
"date_txt" input, so we do not operate with unnormalize and normalize
strings at the same time.

> * I'm still concerned about the risk of integer overflow in the
> string allocations in seq_search_localized(). Those need to be
> adjusted to be more paranoid, similar to the code in e.g. str_tolower.
>

Please find attached a patch with the normalization logic removed, thus no
direct allocations in seq_search_localized().

I would like to rise a couple of questions myself:

* When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’
defined but not used". Should we drop this function or uncomment its usage?

* Would it be worth moving str_tolower(localized_name)
from seq_search_localized() into cache_locale_time()?

[1]
https://www.postgresql.org/message-id/014866c8-d7ff-2a4f-c185-cf7e3ceb7028%402ndquadrant.com

Regards,

Juan José Santamaría Flecha

Attachment Content-Type Size
0001-Allow-localized-month-names-to_date-v8.patch application/octet-stream 11.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-01-24 10:04:25 Re: Preserve versions of initdb-created collations in pg_upgrade
Previous Message Konstantin Knizhnik 2020-01-24 09:43:10 Re: [Proposal] Global temporary tables