Re: UUID performance as primary key

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Navkirat Singh <navkirats(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: UUID performance as primary key
Date: 2010-10-16 19:35:07
Message-ID: AANLkTi=7TtwViYBBQru9p8HbLFBMrkdGY9VyKgh8rGek@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Oct 15, 2010 at 10:59 PM, Craig Ringer
<craig(at)postnewspapers(dot)com(dot)au> wrote:
> On 16/10/2010 9:58 AM, Navkirat Singh wrote:
>>
>> Hi Guys,
>>
>> I am interested in finding out the pros/cons of using UUID as a primary
>> key field. My requirement states that UUID would be perfect in my case as I
>> will be having many small databases which will link up to a global database
>> using the UUID. Hence, the need for a unique key across all databases. It
>> would be extremely helpful if someone could help me figure this out, as it
>> is critical for my project.
>
> Pro: No need for (serverid,serverseq) pair primary keys or hacks with
> modulus based key generation. Doesn't set any pre-determined limit on how
> many servers/databases may be in a cluster.
>
> Con: Slower than modulo key generation approach, uses more storage. Foreign
> key relationships may be slower too.
>
> Overall, UUIDs seem to be a favoured approach. The other way people seem to
> do this is by assigning a unique instance id to each server/database out of
> a maximum "n" instances decided at setup time. Every key generation sequence
> increments by "n" whenever it generates a key, with an offset of the
> server/database id. That way, if n=100, server 1 will generate primary keys
> 001, 101, 201, 301, ..., server 2 will generate primary keys 002, 102, 202,
> 302, ... and so on.
>
> That works great until you need more than 100 instances, at which point
> you're really, REALLY boned. In really busy systems it also limits the total
> amount of primary key space - but with BIGINT primary keys, that's unlikely
> to be something you need to worry about.
>
> The composite primary key (serverid,sequenceid) approach avoids the need for
> a pre-defined maximum number of servers, but can be slow to index and can
> require more storage, especially because of tuple headers.
>
> I have no firsthand experience with any of these approaches so I can't offer
> you a considered opinion. I know that the MS-SQL crowd at least strongly
> prefer UUIDs, but they have very strong in-database UUID support. MySQL
> folks seem to mostly favour the modulo primary key generation approach. I
> don't see much discussion of the issue here - I get the impression Pg
> doesn't see heavy use in sharded environments.

I think your analysis is right on the money except for one thing: the
composite approach doesn't need server_id as part of the key and could
be left off the index. In fact, it can be left off the table
completely since the value is static for the entire database. You
obviously can't check RI between databases so storing the value
everywhere is of no value. server_id only matters when comparing data
from one database to another, which will rarely happen inside a
particular client database (and if it does, you'd have to store the
foreign server_id).

Any 'master' database that did control operations would of course have
to store server_id for each row but I suspect that's not where the
bulk of the data would be. Ditto any application code...it would have
to do something like this:

select server_id(), foo_id from foo where ..

server_id() is of course immutable function. Since you are not
managing 2 billion+ servers, this will be an 'int', or even a
smallint. I think this approach is stronger than UUID approach in
every way. Even stronger would be to not use surrogate keys at all,
but involve what ever makes the decision that routes data between
databases as part of a more natural key (no way to know for sure if
this works for OP w/available info).

I personally dislike sequence hacks.

merlin

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2010-10-16 19:54:14 Re: Stored procedure declared as VOLATILE => no good optimization is done
Previous Message Kenneth Marshall 2010-10-16 18:44:43 Re: Select count(*), the sequel