Re: Unsigned integer types

From: Maciej Gajewski <maciej(dot)gajewski0(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unsigned integer types
Date: 2013-05-28 09:17:42
Message-ID: CAEcSYX+Arn7y4FeYPp6ZgbiiiMfZYmsn9aUyotZB-MA1n5hTOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The reasons are: performance, storage and frustration.

I think the frustration comes from the fact that unsigned integers are
universally available, except in PostgreSQL. I work with a really
complex system, with many moving parts, and Postgres really is one of
the components that causes the least trouble (compared to other
opens-source and closed-sourced systems, which I shall leave unnamed),
except for the unsigned integers.

Let me give you few examples:

1. SMALLINT

Probably the most popular unsigned short int on the planet: IP port
number. I had to store some network traffic data in DB; I
instinctively started to prototyping it like this:

CREATE TABLE packets (addr INET, port SMALLINT, ... );

Of course it failed quickly and I had to bump the size to INTEGER. No
real harm here, as the 2 bytes will probably go into some padding
anyway, but somehow it feels wrong.

2. INTEGER

I had to store a record with several uint32. I had to store an awful
lot of them; hundreds GB of data per day. Roughly half of the record
consists of uint32 fields.
Increasing the data type to bigint would mean that I could store 3
instead of 4 days worth of data on available storage.
Continuing with int4 meant that I would have to deal with the data in
special way when in enters and leaves the DB. It's easy in C: just
cast uint32_t to int32_t. But python code requires more complex
changes. And the web backend too...

It's suffering either way!

Just imagine the conversation I had to have with my boss: "Either
we'll increase budged for storage, or we need to touch every bit of
the system".

3 .BIGINT

There is no escape from bigint. Numeric (or TEXT!) is the only thing
that can keep uint64, but when you have 10^9 and more records, and you
need to do some arithmetic on it, numeric it's just too slow.

We use uint64 all across our system as unique event identifier. It
works fine, it's fast, and it's very convenient. Passing uint64
around, storing it, looking it up. We use it everywhere, including UI
and log files. So once I decided to use BIGINT to store it, I had to
guard all the inputs and outputs and make sure it is handled
correctly. Or so I though.

It turned out that some guys from different department are parsing
some logs with perl parser and they store it in DB. They choose to
store the uint64 id as TEXT. They probably tried BIGINT and failed and
decided that - since they have low volume and they are not doing any
arithmetics - to store it as TEXT.

And now someone came up with an idea to join one table with another,
bigint with text. I did it. Initially I wrote function that converted
the text to numeric, then rotated it around 2^64 if necessary. It was
too slow. Too slow for something that should be a simple
reinterpretation of data.

Eventually I ended up writing a C function, that first scanf(
"%llu")'d the text into uint64_t, and then PG_RETURN_INT64-ed the
uint64_t value. Works fast, but operations hate for increasing the
complexity of DB deployment.

---

I know some cynical people that love this kind of problems, they feel
that the constant struggle is what keeps them employed :) But I'm
ready to use my private time to solve it once and for all.

I'm afraid that implementing uints as and extension would introduce
some performance penalty (I may be wrong). I'm also afraid that with
the extension I'd be left on my own maintaining it forever. While if
this could go into the core product, it would live forever.

As for the POLA violation: programmers experienced with statically
typed languages shouldn't have problems dealing with all the issues
surrounding signed/unsigned integers (like the ones described here:
http://c-faq.com/expr/preservingrules.html). Others don't need to use
them.

Maciek

On 27 May 2013 16:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Maciej Gajewski <maciej(dot)gajewski0(at)gmail(dot)com> writes:
>> The lack of unsigned integer types is one of the biggest sources of
>> grief in my daily work with pgsql.
>> Before I go and start hacking, I'd like to discuss few points:
>> 1. Is there a strong objection against merging this kind of patch?
>
> Basically, there is zero chance this will happen unless you can find
> a way of fitting them into the numeric promotion hierarchy that doesn't
> break a lot of existing applications. We have looked at this more than
> once, if memory serves, and failed to come up with a workable design
> that didn't seem to violate the POLA.
>
>> 2. How/if should the behaviour of numeric literals change?
>
>> The minimalistic solution is: it shouldn't, literals should be assumed
>> signed by default. More complex solution could involve using C-style
>> suffix ('123456u').
>
> Well, if you don't do that, there is no need for you to merge anything:
> you can build unsigned types as an external extension if they aren't
> affecting the core parser's behavior. As long as it's external, you
> don't need to satisfy anybody else's idea of what reasonable behavior
> is ...
>
> regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Szymon Guz 2013-05-28 10:02:18 converting numeric to string in postgres code
Previous Message Craig Ringer 2013-05-28 08:38:11 Re: [PATCH] add --throttle to pgbench (submission 3)