Re: Creating large database of MD5 hash values

From: Florian Weimer <fweimer(at)bfk(dot)de>
To: "Jon Stewart" <jonathan(dot)l(dot)stewart(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Creating large database of MD5 hash values
Date: 2008-04-11 17:04:00
Message-ID: 82skxsqq67.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

* Jon Stewart:

>> BYTEA is slower to load and a bit inconvenient to use from DBI, but
>> occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
>> bytes with PostgreSQL 8.3).

> Can you clarify the "slower to load" point? Where is that pain point
> in the postgres architecture?

COPY FROM needs to read 2.5 bytes on average, instead 2, and a complex
form of double-decoding is necessary.

> Storing the values in binary makes intuitive sense to me since the
> data is twice as dense, thus getting you more bang for the buck on
> comparisons, caching, and streaming reads. I'm not too concerned about
> raw convenience, as there's not going to be a lot of code around my
> application.

The main issue is that you can't use the parameter-providing version
of $sth->execute (or things like $sth->selectarray, $sth->do), you
must use explicit binding by parameter index in order to specify the
type information.

> The idea is that you have named sets of hash values, and hash values
> can be in multiple sets.

The ID step is only going to help you if your sets are very large and
you use certain types of joins, I think. So it's better to
denormalize in this case (if that's what you were alluding to in your
original post).

> The big operations will be to calculate the unions, intersections, and
> differences between sets. That is, I'll load a new set into the
> database and then see whether it has anything in common with another
> set (probably throw the results into a temp table and then dump it
> out).

In this case, PostgreSQL's in-memory bitmap indices should give you
most of the effect of your hash <-> ID mapping anyway.

> I will also periodically run queries to determine the size of
> the intersection of two sets for all pairs of sets (in order to
> generate some nice graphs).

I think it's very difficult to compute that efficiently, but I haven't
thought much about it. This type of query might benefit from your
hash <-> ID mapping, however, because the working set is smaller.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Craig Ringer 2008-04-11 18:03:43 Re: Performance increase with elevator=deadline
Previous Message Matthew 2008-04-11 16:40:02 Re: Performance increase with elevator=deadline