Re: bytea encode performance issues

From: Sim Zacks <sim(at)compulab(dot)co(dot)il>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: bytea encode performance issues
Date: 2008-08-07 14:40:14
Message-ID: 489B094E.2090200@compulab.co.il
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Merlin,

You are suggesting a fight with the flexible dynamics of email by
fitting it into a UTF shell - it doesn't always work.

I would suggest you read the postgresql definition of SQL-ASCII:
> The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting, because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

It says, In most cases it is unwise to use it if you are working with
non-ascii data. That is because most situations do not accept multiple
encodings. However, email is a special case where the user does not have
control of what is being sent. Therefore it is possible (and it happens
to us) that we get emails that are not convertible to UTF-8.

The only way I could convert from mysql, which does not check encoding
to postgresql utf-8 was to first use the SQL-ASCII database as a bridge,
because it did not check the encoding and load it into a bytea and then
take a backup of the database and restore it into a UTF-8 database.

Sim

Merlin Moncure wrote:
> On Thu, Aug 7, 2008 at 9:38 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Thu, Aug 7, 2008 at 1:16 AM, Sim Zacks <sim(at)compulab(dot)co(dot)il> wrote:
>>>> I don't quite follow that...the whole point of utf8 encoded database
>>>> is so that you can use text functions and operators without the bytea
>>>> treatment. As long as your client encoding is set up properly (so
>>>> that data coming in and out is computed to utf8), then you should be
>>>> ok. Dropping to ascii is usually not the solution. Your data
>>>> inputting application should set the client encoding properly and
>>>> coerce data into the unicode text type...it's really the only
>>>> solution.
>>>>
>>> Email does not always follow a specific character set. I have tried
>>> converting the data that comes in to utf-8 and it does not always work.
>>> We receive Hebrew emails which come in mostly 2 flavors, UTF-8 and
>>> windows-1255. Unfortunately, they are not compatible with one another.
>>> SQL-ASCII and ASCII are different as someone on the list pointed out to
>>> me. According to the documentation, SQL-ASCII makes no assumption about
>>> encoding, so you can throw in any encoding you want.
>> no, you can't! SQL-ASCII means that the database treats everything
>> like ascii. This means that any operation that deals with text could
>> (and in the case of Hebrew, almost certianly will) be broken. Simple
>> things like getting the length of a string will be wrong. If you are
>> accepting unicode input, you absolutely must be using a unicode
>> encoded backend.
>
> er, I see the problem (single piece of text with multiple encodings
> inside) :-). ok, it's more complicated than I thought. still, you
> need to convert the email to utf8. There simply must be a way,
> otherwise your emails are not well defined. This is a client side
> problem...if you push it to the server in ascii, you can't use any
> server side text operations reliably.
>
> merlin
>
> merlin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message RASHA OSMAN 2008-08-07 14:43:28 Response time between shared buffer cache and operating system
Previous Message Alvaro Herrera 2008-08-07 14:35:38 Re: bytea encode performance issues