Re: converting databases form SQL_ASCII to UTF8

From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Geoffrey Myers <lists(at)serioustechnology(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: converting databases form SQL_ASCII to UTF8
Date: 2011-04-22 18:35:44
Message-ID: 4DB1CA80.1050103@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 04/22/2011 08:00 AM, Geoffrey Myers wrote:
> We are moving our databases to new hardware soon, so we felt it would
> be a good time to get the encoding correct. Our databases are
> currently SQL_ASCII and we plan to move them to UTF8.
We are in the same boat, fortunately only on one older server we are
upgrading and fortunately for internal apps....
>
>
> So, as previously noted, there are certain characters that won't load
> into a UTF8 database from a dump of the SQL_ASCII database.
>
> Here's our problem. We planned on moving databases a few at a time.
> Problem is, there is a process that pushes data from one database to
> another. If this process attempts to push data from a SQL_ASCII
> database to a new UTF8 database and it has one of these characters
> mentioned above, the process fails.
>
> So, now the question is, is this effort even worth our effort?
>
> What is the harm in leaving our databases SQL_ASCII encoded?

SQL_ASCII is a synonym for "no encoding." You put in a stream of bytes
and that's what you get out. That's OK if the byte-stream has exactly
the same meaning to every application and user. If that's not the case
then you have bytes in your database but you don't know what those bytes
are supposed to represent.

In a way, it's like having a generic integer column but depending on the
user or the application, that column might represent a unix epoch
timestamp, an age in years, a salary, a weight in grams, furlongs per
fortnight, etc. And there is no indicator to say which it is. Not good.

We are in the final stages of cleaning up our last bit of non-utf8 data
and the above some what silly example is actually not far from the
truth. Due to data that arrived from web-inputs, spreadsheet imports,
command-line, internal-apps, etc. we have been faced with cleaning
tables where one row has only basic ASCII data, another has UTF8, while
others have various different Microsoft encodings. With data like that
it's pretty much impossible to guarantee that even a simple web-report
will display fully correctly.

The longer you wait, the worse it gets. Even though it is only a tiny
percentage of our data, cleaning it is still a pain.

Is it worth it? Dunno - you have to calculate the cost/benefit. For us
it was a no-brainer to bite-the-bullet and do it.

Cheers,
Steve

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Bob Pawley 2011-04-22 19:52:28 Re: Different views of remote server
Previous Message SUBHAM ROY 2011-04-22 18:12:55 Re: Needs Suggestion