Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From: Joel Jacobson <joel(at)trustly(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Joerg Sonnenberger <joerg(at)bec(dot)de>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date: 2019-03-20 19:55:56
Message-ID: CAASwCXdc1Yh10=gS1JjtF=Yn-zRsTG01k0Yw3M2xVrs7K5zq6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Many thanks for working on this, amazing work, really nice you made it a
separate reusable Perl-module.

The generated hash functions reads one character at a time.
I've seen a performance trick in other hash functions [1]
to instead read multiple bytes in each iteration,
and then handle the remaining bytes after the loop.

[1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29

I've done some testing and it looks like a ~30% speed-up of the generated
ScanKeywords_hash_func() function would be possible.

If you think this approach is promising, I would be happy to prepare a
patch for it,
but I wanted to check with the project this idea has not already been
considered and ruled out
for some technical reasons I've failed to see, is there any?

For this to work you would need to use larger constants for $hash_mult1
and $hash_mult2 though.
I've successfully used these values:
$hash_mult1 0x2c1b3c6d
$hash_mult2 (0x297a2d39, 0x85ebca6b, 0xc2b2ae35, 0x7feb352d, 0x846ca68b)

Here is the idea:

Generated C-code:

for (; keylen >= 4; keylen -= 4, k += 4)
{
uint32_t v;
memcpy(&v, k, 4);
v |= 0x20202020;
a = a * 739982445 + v;
b = b * 2246822507 + v;
}
uint32_t v = 0;
switch (keylen)
{
case 3:
memcpy(&v, k, 3);
v |= 0x202020;
break;
case 2:
memcpy(&v, k, 2);
v |= 0x2020;
break;
case 1:
memcpy(&v, k, 1);
v |= 0x20;
break;
}
a = a * 739982445 + v;
b = b * 2246822507 + v;
return h[a % 883] + h[b % 883];

(Reding 8 bytes a time instead would perhaps be a win since some keywords
are quite long.)

Perl-code:

sub _calc_hash
{
my ($key, $mult, $seed) = @_;

my $result = $seed;
my $i=0;
my $keylen = length($key);

for (; $keylen>=4; $keylen-=4, $i+=4) {
my $cn = (ord(substr($key,$i+0,1)) << 0)
| (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16)
| (ord(substr($key,$i+3,1)) << 24);
$cn |= 0x20202020 if $case_fold;
$result = ($result * $mult + $cn) % 4294967296;
}

my $cn = 0;
if ($keylen == 3) {
$cn = (ord(substr($key,$i+0,1)) << 0)
| (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16);
$cn |= 0x202020 if $case_fold;
} elsif ($keylen == 2) {
$cn = (ord(substr($key,$i+0,1)) << 0)
| (ord(substr($key,$i+1,1)) << 8);
$cn |= 0x2020 if $case_fold;
} elsif ($keylen == 1) {
$cn = (ord(substr($key,$i+0,1)) << 0);
$cn |= 0x20 if $case_fold;
}
$result = ($result * $mult + $cn) % 4294967296;

return $result;
}

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message legrand legrand 2019-03-20 20:05:06 Re: [survey] New "Stable" QueryId based on normalized query text
Previous Message Alvaro Herrera 2019-03-20 19:53:44 Re: PostgreSQL pollutes the file system