Re: MD5 aggregate

From: Marko Kreen <markokr(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-17 11:53:30
Message-ID: 20130617115330.GA21009@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> > There seem to be 2 separate directions that this could go, which
> > really meet different requirements:
> >
> > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > the order in which they are scanned. A possible approach to this might
> > be something like an aggregate
> >
> > md5_total(text/bytea) returns text
> >
> > that returns the sum of the md5 values of each input value, treating
> > each md5 value as an unsigned 128-bit integer, and then producing the
> > hexadecimal representation of the final sum. This should out-perform a
> > solution based on numeric addition, and in typical cases, the result
> > wouldn't be much longer than a regular md5 sum, and so would be easy
> > to eyeball for differences.
> >
>
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
>
> SELECT md5_agg(foo.*::text)
> FROM (SELECT * FROM foo ORDER BY id) foo;
>
> 50bc42127fb9b028c9708248f835ed8f
>
> Time: 92960.021 ms
>
> SELECT md5_total(foo.*::text) FROM foo;
>
> 02faea7fafee4d253fc94cfae031afc43c03479c
>
> Time: 96190.343 ms
>
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.
>
> Note, however, that if there is an index that can be used for
> ordering, the performance is not necessarily better than md5_agg(), as
> this example shows. There is a small additional overhead per row for
> initialising the MD5 sums, and adding the results to the total, but I
> think the biggest factor is that md5_total() is processing more data.
> The reason is that MD5 works on 64-byte blocks, so the total amount of
> data going through the core MD5 algorithm is each row's size is
> rounded up to a multiple of 64. In this simple case it ends up
> processing around 1.5 times as much data:
>
> SELECT sum(length(foo.*::text)) AS md5_agg,
> sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
>
> md5_agg | md5_total
> ------------+-------------
> 8103815438 | 12799909248
>
> although of course that overhead won't be as large on wider tables,
> and even in this case the overall performance is still on a par with
> md5_agg().
>
> ISTM that both aggregates are potentially useful in different
> situations. I would probably typically use md5_total() because of its
> simplicity/order-independence and consistent performance, but
> md5_agg() might also be useful when comparing with external data.

Few notes:

- Index-scan over whole table is *very* bad for larger tables (few times
bigger than available RAM). If you want to promote such use you should
also warn against use on loaded server.

- It's pointless to worry about overflow on 128-bit ints. Just let it
happen. Adding complexity for that does not bring any advantage.

- Using some faster 128-bit hash may be useful - check out CityHash
or SpookyHash. You can get C implementation from pghashlib.

--
marko

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Szymon Guz 2013-06-17 11:59:43 Re: Add regression tests for SET xxx
Previous Message Albe Laurenz 2013-06-17 11:49:02 Review: Display number of changed rows since last analyze