Re: UTF8 problem

From: Stephane Bortzmeyer <bortzmeyer(at)nic(dot)fr>
To: Douglas McNaught <doug(at)mcnaught(dot)org>
Cc: Tino Wildenhain <tino(at)wildenhain(dot)de>, Alban Hertroys <alban(at)magproductions(dot)nl>, "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: UTF8 problem
Date: 2006-06-15 14:34:30
Message-ID: 20060615143430.GA17590@nic.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Jun 08, 2006 at 07:25:35AM -0400,
Douglas McNaught <doug(at)mcnaught(dot)org> wrote
a message of 29 lines which said:

> I would think it would (at least potentially) vary with each
> message. The dbmail software should really set client_encoding
> based on the Content-Transfer-Encoding header in the message (or
> whatever it's called).

A *big* warning from someone who stores email in PostgreSQL: many
email messages *lie*. They have a Content-transfer-encoding and then
they actually use another encoding.

If you blindly try to inject the body of the message into PostgreSQL,
with the indicated encoding, you will sometimes fail, for instance if
the message claim to be in UTF-8 but is not (something that PostgreSQL
will detect).

Either you:

* "sanitize" all incoming data
* or you accept to reject these invalid email
* or you store them in a unstructured field (a blob)

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jon Lapham 2006-06-15 14:51:21 A few questions about carriage returns (\r)
Previous Message surabhi.ahuja 2006-06-15 13:07:32 B+ versus hash maps