Re: [PATCHES] updated hash functions for postgresql v1

From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] updated hash functions for postgresql v1
Date: 2009-01-26 04:27:03
Message-ID: 20090126042703.GA14738@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > I ran 5 times on both old and new code, eliminating the top and bottom
> > and taking the average of the remaining 3, and I got a 6.9% performance
> > improvement with the new code.
>
> The question that has been carefully evaded throughout the discussion
> of this patch is whether the randomness of the hash result is decreased,
> and if so what is that likely to cost us in performance of the various
> hash-dependent algorithms. I would still like to have an answer to that
> before we make a change to gain marginal performance improvement in
> the hash function itself (which is already shown to be barely measurable
> in the total context of a hash-dependent operation...)
>
> regards, tom lane
>

Dear Hackers and Reviewers,

In response to Tom's questioning the randomness of the hash_any
when the mixing functions are split into two, the first used when
sequentially processing the input and the second for the final
mix, I have generated a more detailed analysis of the two hash_any
functions.

First, one change to the 11/2008 patch, the keylen is added to a, b and
c initially so we do not need to add it later on. The is the
micro-diff:
-----------------------------

--- hashfunc.c_TWOMIX 2009-01-22 14:07:34.000000000 -0600
+++ hashfunc.c_TWOMIX2 2009-01-22 14:17:32.000000000 -0600
@@ -336,7 +336,6 @@

/* handle the last 11 bytes */
k = (const unsigned char *) ka;
- c += keylen;
#ifdef WORDS_BIGENDIAN
switch (len)
{
@@ -439,7 +438,6 @@
}

/* handle the last 11 bytes */
- c += keylen;
#ifdef WORDS_BIGENDIAN
switch (len) /* all the case statements fall through */

-----------------------------

The remainder of this document will use the runs from my initial results
broken out using various power-of-two bucket sizes to simulate our actual
use in PostgreSQL as the number of items hashed increases and we use more
and more bits of our hash to identify the appropriate bucket. I have run
each test twice, once with our current hash_any() with the single mix()
function and then a second time using my patch from the November commitfest
plus the patch above to produce a new hash_any() with two separate mixing
functions mix() and final(). For each run I have generated a sequence of
unique inputs, up to the number of buckets, hashed them with the hash
functions (both old and new), then I calculate the expected number of
collision p(n) using the poisson formula for each number of buckets,
where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and
2**26. For my initial run, I used a string consisting of the letter 'a'
followed by the integer representation of the numbers from 0 to the
(number of buckets - 1):

1) "a"uint32 ((i.e. a00001,a0002...)
Number of buckets: 65536
Total number of items: 65536
Expected number with n items: 24109 24109 12054 4018 1004 200 33 4
Actual number mix(): 24044 24172 12078 4036 980 186 30 10
Actual number mix()/final(): 24027 24232 12060 3972 1001 207 31 5 1

Number of buckets: 262144
Total number of items: 262144
Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2
Actual number mix(): 96224 96730 48240 15951 4094 744 143 17 1
Actual number mix()/final(): 96335 96646 48071 16128 4018 801 122 21 2

Number of buckets: 1048576
Total number of items: 1048576
Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9
Actual number mix(): 385716 385596 193243 64115 16053 3285 478 77 12 1
Actual number mix()/final(): 385955 385016 193789 63768 16259 3190 511 79 8 1

Number of buckets: 4194304
Total number of items: 4194304
Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 306 38
Actual number mix(): 1542536 1543489 771351 257777 63830 12847 2123 326 19 5 1
Actual number mix()/final(): 1541828 1544429 772072 256178 64579 12774 2129 288 22 5

Number of buckets: 16777216
Total number of items: 16777216
Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 1224 153
Actual number mix(): 6170866 6174079 3085912 1027140 257808 51385 8638 1219 146 23
Actual number mix()/final(): 6172058 6171991 3086279 1027916 257535 51465 8554 1243 149 23 3

Number of buckets: 67108864
Total number of items: 67108864
Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 34288 4898 612
Actual number mix(): 24686110 24690897 12344232 4113515 1028232 205682 34546 4942 629 72 7
Actual number mix()/final(): 24708515 24669248 12333034 4114796 1034256 208424 34888 5023 598 77 5

Here is a second run with number of items = (number of buckets)/2:
Number of buckets: 65536
Total number of items: 32768
Expected number with n items: 39749 19874 4968 828 103 10
Actual number mix(): 39704 19978 4898 842 103 10 1
Actual number mix()/final(): 39715 19922 4991 779 119 9 1

Number of buckets: 262144
Total number of items: 131072
Expected number with n items: 158998 79499 19874 3312 414 41 3
Actual number mix(): 158753 79972 19703 3216 458 38 4
Actual number mix()/final(): 158869 79688 19853 3303 393 33 3 2

Number of buckets: 1048576
Total number of items: 524288
Expected number with n items: 635993 317996 79499 13249 1656 165 13
Actual number mix(): 636118 317839 79388 13435 1628 152 16
Actual number mix()/final(): 636102 317824 79527 13267 1683 161 12

Number of buckets: 4194304
Total number of items: 2097152
Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
Actual number mix(): 2544529 1270639 318904 53005 6516 645 61 5
Actual number mix()/final(): 2544014 1271483 318720 52849 6559 630 47 2

Number of buckets: 16777216
Total number of items: 8388608
Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
Actual number mix(): 10174997 5089736 1271355 211528 26679 2683 220 17 1
Actual number mix()/final(): 10176567 5087207 1271789 212014 26684 2703 235 16 1

Number of buckets: 67108864
Total number of items: 33554432
Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
Actual number mix(): 40703033 20353345 5086252 848860 105885 10536 891 59 3
Actual number mix()/final(): 40770874 20250828 5096890 865448 111835 11890 1013 76 10

2) uint32uint32 (i.e. uint64)
Number of buckets: 65536
Total number of items: 32768
Expected number with n items: 39749 19874 4968 828 103 10
Actual number mix(): 39770 19863 4947 822 125 9
Actual number mix()/final(): 39754 19852 4990 837 91 11 1

Number of buckets: 262144
Total number of items: 131072
Expected number with n items: 158998 79499 19874 3312 414 41 3
Actual number mix(): 159046 79464 19823 3334 430 43 3 1
Actual number mix()/final(): 158879 79732 19774 3301 406 47 5

Number of buckets: 1048576
Total number of items: 524288
Expected number with n items: 635993 317996 79499 13249 1656 165 13
Actual number mix(): 636275 317580 79514 13357 1661 172 14 3
Actual number mix()/final(): 635578 318574 79504 13162 1585 159 13 1

Number of buckets: 4194304
Total number of items: 2097152
Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
Actual number mix(): 2543769 1272196 318067 53050 6497 672 48 4 1
Actual number mix()/final(): 2543014 1273485 317812 52703 6589 635 59 7

Number of buckets: 16777216
Total number of items: 8388608
Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
Actual number mix(): 10174587 5090467 1270909 211836 26513 2679 207 18
Actual number mix()/final(): 10175142 5089481 1270919 212555 26234 2643 224 15 3

Number of buckets: 67108864
Total number of items: 33554432
Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
Actual number mix(): 40706281 20347746 5088639 848133 106388 10669 946 60 2
Actual number mix()/final(): 40701438 20354677 5088549 846570 106157 10578 838 55 2

3) uint32 from 1-1648379
Number of buckets: 65536
Total number of items: 32768
Expected number with n items: 39749 19874 4968 828 103 10
Actual number mix(): 39758 19844 4993 835 97 9
Actual number mix()/final(): 39842 19712 5016 849 109 7 1

Number of buckets: 262144
Total number of items: 131072
Expected number with n items: 158998 79499 19874 3312 414 41 3
Actual number mix(): 158973 79523 19900 3283 426 38 1
Actual number mix()/final(): 159136 79293 19896 3346 421 48 3 1

Number of buckets: 1048576
Total number of items: 524288
Expected number with n items: 635993 317996 79499 13249 1656 165 13
Actual number mix(): 636083 317917 79484 13183 1711 180 16 2
Actual number mix()/final(): 636183 317772 79438 13305 1685 173 20

Number of buckets: 4194304
Total number of items: 2097152
Expected number with n items: 2543973 1271986 317996 52999 6624 662 55 3
Actual number mix(): 2544325 1271367 318260 52950 6660 683 54 4 1
Actual number mix()/final(): 2544022 1272059 317778 53060 6646 665 70 4

Number of buckets: 16777216
Total number of items: 8388608
Expected number with n items: 10175895 5087947 1271986 211997 26499 2649 220 15
Actual number mix(): 10176360 5087249 1272083 212010 26655 2629 212 18
Actual number mix()/final(): 10175315 5088617 1272068 212144 26224 2589 234 23 1 1

Number of buckets: 67108864
Total number of items: 33554432
Expected number with n items: 40703583 20351791 5087947 847991 105998 10599 883 63 3
Actual number mix(): 40704034 20350542 5088760 848309 105721 10520 894 77 7
Actual number mix()/final(): 40774785 20243786 5098862 866801 111865 11641 1019 100 5

4) cracklib
Number of buckets: 65536
Total number of items: 32767
Expected number with n items: 39750 19874 4968 828 103 10
Actual number mix(): 39731 19858 5049 794 92 11 1
Actual number mix()/final(): 39785 19801 5011 823 106 9 1

Number of buckets: 262144
Total number of items: 131071
Expected number with n items: 158998 79498 19874 3312 414 41 3
Actual number mix(): 159077 79420 19806 3380 409 49 3
Actual number mix()/final(): 159020 79475 19845 3366 383 54 1

Number of buckets: 1048576
Total number of items: 524287
Expected number with n items: 635994 317996 79498 13249 1656 165 13
Actual number mix(): 635979 318066 79420 13281 1641 163 23 3
Actual number mix()/final(): 635694 318637 79122 13271 1686 150 13 3

Number of buckets: 4194304
Total number of items: 1648379
Expected number with n items: 2831263 1112698 218647 28643 2814 221 14
Actual number mix(): 2830769 1113458 218633 28396 2786 250 11 1
Actual number mix()/final(): 2831304 1113004 218002 28843 2926 212 13

Number of buckets: 16777216
Total number of items: 1648379
Expected number with n items: 15207226 1494125 73399 2403 59 1
Actual number mix(): 15206923 1494718 73129 2384 59 3
Actual number mix()/final(): 15207183 1494267 73250 2452 64

Number of buckets: 67108864
Total number of items: 1648379
Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08
Actual number mix(): 65480389 1608727 19593 154 1
Actual number mix()/final(): 65480455 1608609 19631 168 1

5) cracklib x 100
Number of buckets: 65536
Total number of items: 32767
Expected number with n items: 39750 19874 4968 828 103 10
Actual number mix(): 39829 19742 5003 847 99 14 2
Actual number mix()/final(): 39783 19794 5023 828 97 11

Number of buckets: 262144
Total number of items: 131071
Expected number with n items: 158998 79498 19874 3312 414 41 3
Actual number mix(): 159136 79242 20007 3289 405 62 3
Actual number mix()/final(): 158961 79546 19905 3270 408 51 3

Number of buckets: 1048576
Total number of items: 524287
Expected number with n items: 635994 317996 79498 13249 1656 165 13
Actual number mix(): 635815 318331 79372 13218 1658 168 12 2
Actual number mix()/final(): 635986 318152 79309 13231 1687 190 21

Number of buckets: 4194304
Total number of items: 1648379
Expected number with n items: 2831263 1112698 218647 28643 2814 221 14
Actual number mix(): 2830891 1113468 218202 28712 2804 208 18 1
Actual number mix()/final(): 2831786 1111616 219209 28661 2816 199 16 1

Number of buckets: 16777216
Total number of items: 1648379
Expected number with n items: 15207226 1494125 73399 2403 59 1
Actual number mix(): 15207428 1493777 73492 2459 59 1
Actual number mix()/final(): 15207777 1493063 73877 2435 63 1

Number of buckets: 67108864
Total number of items: 1648379
Expected number with n items: 65480564 1608383 19753 161 1.47989202515749e-08
Actual number mix(): 65480463 1608595 19635 170 1
Actual number mix()/final(): 65480650 1608222 19820 171 1

As you can see, all of the probabilities for both the current single mix()
function and the split mix()/final() functions match the theoretical results
as calculated by the poisson formula. Here is a nice reference that I found
online describing the testing procedure that I am using:

http://rabbit.eng.miami.edu/class/een318/poisson.pdf

I can find no situations where the current version of hash_any() performs
statistically better than the version resulting from my split mix()/final()
function patch from 11/2008. I would be happy to share the test harness with
anyone who wishes to test further. Please let me know if there is anything
I can do. I still recommend applying the split mixing function patch that
I submitted for the 11/2008 commitfest and I think these tests should allay
Tom's concerns about the randomness of the new hash function.

Regards,
Ken Marshall

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message KaiGai Kohei 2009-01-26 05:00:10 Re: 8.4 release planning (was Re: [COMMITTERS] pgsql: Automatic view update rules)
Previous Message ITAGAKI Takahiro 2009-01-26 04:14:00 Compiler warnings fix