Re: Fixed length data types issue

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fixed length data types issue
Date: 2006-09-07 14:38:10
Message-ID: 87k64fbv3h.fsf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:

> On Thu, Sep 07, 2006 at 01:27:01PM +0100, Gregory Stark wrote:
>> ... If you look again at the columns in my example you'll
>> see none of them are appropriate targets for i18n anyways. They're all codes
>> and even numbers.
>
> Which begs the question of why they don't store the numbers in numeric
> columns? That'll take far less space than any string.

Sure, unless by PRVDR_NUM they mean things like '000001' or even 'C00001'.

...

> It makes sense to store them as numbers, or perhaps an enum.

Sure. If you're designing the schema from scratch and don't have to
interoperate with any other systems. But if you're receiving a flat ascii text
file and it has a 5 character opaque identifier called "FI_NUM" which do you
think is the safer approach for storing these opaque identifiers?

You can suggest that there are other ways of designing a schema that will work
better with Postgres but I think you're just working around Postgre
deficiencies.

These may be deficiencies that are pretty low down your priority list but they
may be higher up my list. I just don't think you can argue they're not
deficiencies just because you know better than to get tripped up by them. I'm
discussing these things with an eye to getting some kind of consensus on what
should be done about them so I can go do it, not because I'm trying to get you
to work on it :)

>> Currently Postgres has a limitation that you can only have one encoding per
>> database and one locale per cluster. Personally I'm of the opinion that the
>> only correct choice for that is "C" and all localization should be handled in
>> the client and with pg_strxfrm. Putting the whole database into non-C locales
>> guarantees that the columns that should not be localized will have broken
>> semantics and there's no way to work around things in the other direction.
>
> Quite. So if someone would code up SQL COLLATE support and integrate
> ICU, everyone would be happy and we could all go home.

Well I for one would be pretty unhappy if ICU were integrated. It seems like a
whole pile of code and complexity for no particular gain. The standard i18n
support with a few extensions (namely strcoll_l) seems to be adequate for us
and not introduce huge new dependencies and code burdens.

> BTW, requireing localisation to happen in the client is silly. SQL
> provides the ORDER BY clause for strings and it'd be silly to have the
> client resort them just because they're not using C locale. The point
> of a database was to make your life easier, right?

That's why I mentioned pg_strxfrm. It doesn't solve all your problems if
you're doing lots of string manipulations in queries but it can handle
collation so you can at least execute ORDER BY clauses which of course you
can't efficiently do in the client. For anything more complex you're probably
happier doing your string manipulations in the client just because SQL's
string primitives are so, well, primitive.

>> Perhaps given the current situation what we should have is a cvarchar and
>> cchar data types that are like varchar and char but guaranteed to always be
>> interpreted in the c locale with ascii encoding.
>
> I think bytea gives you that, pretty much.

Hm, that's an interesting idea. We could define all the string functions for
bytea as well.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2006-09-07 14:56:45 Re: Fixed length data types issue
Previous Message Dave Cramer 2006-09-07 14:34:15 Re: getting access to gborg, specifically the jdbc CVS files