Re: Creating large database of MD5 hash values

From: "Jon Stewart" <jonathan(dot)l(dot)stewart(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Creating large database of MD5 hash values
Date: 2008-04-11 15:28:55
Message-ID: c0787d8a0804110828w52a0f614v6130fca72b85956c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> > 1. Which datatype should I use to represent the hash value? UUIDs are
> > also 16 bytes...
>
> BYTEA is slower to load and a bit inconvenient to use from DBI, but
> occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
> bytes with PostgreSQL 8.3).

Can you clarify the "slower to load" point? Where is that pain point
in the postgres architecture?

Storing the values in binary makes intuitive sense to me since the
data is twice as dense, thus getting you more bang for the buck on
comparisons, caching, and streaming reads. I'm not too concerned about
raw convenience, as there's not going to be a lot of code around my
application. I haven't built the thing yet so it's hard to say what
performance will be like, but for the users the difference between an
8 hour query that can run overnight and a 16 hour query that they must
wait on is significant.

> > 2. Does it make sense to denormalize the hash set relationships?
>
> That depends entirely on your application.

General schema would be as such:

HASH_VALUES
datatype md5;
bigint id;

SET_LINK
integer hash_value_id;
integer hash_set_id;

HASH_SETS
integer id;
varchar name;
// other data here

The idea is that you have named sets of hash values, and hash values
can be in multiple sets.

The big operations will be to calculate the unions, intersections, and
differences between sets. That is, I'll load a new set into the
database and then see whether it has anything in common with another
set (probably throw the results into a temp table and then dump it
out). I will also periodically run queries to determine the size of
the intersection of two sets for all pairs of sets (in order to
generate some nice graphs).

The number of sets could grow into the thousands, but will start
small. One of the sets I expect to be very large (could account for
50%-90% of all hashes); the others will all be smaller, and range from
10,000 in size to 1,000,000. The number of hashes total could get into
the hundreds of millions, possibly billions.

One problem I'm worried about is the lack of concurrency in the
application. It will be relatively rare for more than one query to be
inflight at a time; this is not a high connection application. It
doesn't sound like I'd get any marginal performance improvement out of
postgres by throwing more cores at the problem (other than dualcore;
always nice to have a spare handling everything else).

Thanks very much for the comments from all. Pretty simple application
conceptually, just one at a large scale. Other approaches (map
reduce-ish or straightahead turnkey storage) could potentially provide
better performance, but the users feel more comfortable maintaining
databases and the overall convenience of a database over other systems
is nice.

Jon

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew 2008-04-11 16:40:02 Re: Performance increase with elevator=deadline
Previous Message Jeff 2008-04-11 15:09:07 Re: Performance increase with elevator=deadline