Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, John Naylor <jcnaylor(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date: 2018-12-28 08:54:49
Message-ID: 20181228085449.v2veeelcdpstmlsq@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018-12-27 14:22:11 -0500, Tom Lane wrote:
> Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
> > A smaller project might be to see if we can replace the binary keyword
> > search in ScanKeyword with a perfect hashing function generated by
> > gperf, or something similar. I had a quick look at that, too.
>
> Yeah, we've looked at gperf before, eg
>
> https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbdnmo@alap3.anarazel.de
>
> Perhaps it'd be a win but I'm not very convinced.

Note that the tradeoffs mentioned there, by memory, aren't necessarily
applicable here. As we're dealing with strings anyway, gperf wanting to
deal with strings rather than being able to deal with numbers isn't
problematic.

> I don't know much about the theory of perfect hashing, but I wonder
> if we could just roll our own tool for that. Since we're not dealing
> with extremely large keyword sets, perhaps brute force search for a
> set of multipliers for a hash computation like
> (char[0] * some_prime + char[1] * some_other_prime ...) mod table_size
> would work.

The usual way to do do perfect hashing is to bascially have a two stage
hashtable, with the first stage keyed by a "normal" hash fuinction, and
the second one disambiguating the values that hash into the same bucket,
by additionally keying a hash-function with the value in the cell in the
intermediate hash table. Determining the parameters in the intermediate
table is what takes time. That most perfect hash functions look like
that way is also a good part of the reason why I doubt it's worthwhile
to go there over a simple linear probing hashtable, with a good
hashfunction - computing two hash-values will usually be worse than
linear probing for *small* and *not modified* hashtables.

A simple (i.e. slow for large numbers of keys) implementation for
generating a perfect hash function isn't particularly hard. E.g. look at
the python implementation at http://iswsa.acm.org/mphf/index.html and
http://stevehanov.ca/blog/index.php?id=119 for an easy explanation with
graphics.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2018-12-28 08:55:49 Re: insensitive collations
Previous Message Mitar 2018-12-28 08:48:46 Re: Feature: triggers on materialized views