Quick Links

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From:	Joerg Sonnenberger <joerg(at)bec(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	John Naylor <jcnaylor(at)gmail(dot)com>, Joerg Sonnenberger <joerg(at)bec(dot)de>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date:	2019-01-04 21:10:03
Message-ID:	20190104211003.GA18302@britannica.bec.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote:
> John Naylor <jcnaylor(at)gmail(dot)com> writes:
> > On 1/3/19, Joerg Sonnenberger <joerg(at)bec(dot)de> wrote:
> >> I was pointed at your patch on IRC and decided to look into adding my
> >> own pieces. What I can provide you is a fast perfect hash function
> >> generator. I've attached a sample hash function based on the current
> >> main keyword list. hash() essentially gives you the number of the only
> >> possible match, a final strcmp/memcmp is still necessary to verify that
> >> it is an actual keyword though. The |0x20 can be dropped if all cases
> >> have pre-lower-cased the input already. This would replace the binary
> >> search in the lookup functions. Returning offsets directly would be easy
> >> as well. That allows writing a single string where each entry is prefixed
> >> with a type mask, the token id, the length of the keyword and the actual
> >> keyword text. Does that sound useful to you?
>
> > Judging by previous responses, there is still interest in using
> > perfect hash functions, so thanks for this. I'm not knowledgeable
> > enough to judge its implementation, so I'll leave that for others.
>
> We haven't actually seen the implementation, so it's hard to judge ;-).

It's a temporary hacked version of nbperf in the NetBSD tree.

> The sample hash function certainly looks great. I'm not terribly on board
> with using |0x20 as a substitute for lower-casing, but that's a minor
> detail.

Yeah, I've included that part more because I don't know the current use
cases enough. If all instances are already doing lower-casing in
advance, it is trivial to drop.

> The foremost questions in my mind are:
>
> * What's the generator written in? (if the answer's not "Perl", wedging
> it into our build processes might be painful)

Plain C, nothing really fancy in it.

> * What license is it under?

Two clause BSD license.

> * Does it always suceed in producing a single-level lookup table?

This question is a bit tricky. The short answer is: yes. The longer
answer: The choosen hash function in the example is very simple (e.g.
just two variations of DJB-style hash), so with that: no, not without
potentially fiddling a bit with the hash function if things ever get
nasty like having two keywords that hit a funnel for both variants. The
main concern for the choice was to be fast. When using two families of
independent hash functions, the generator requires a probalistic linear
time in the number of keys. That means with a strong enough hash function
like the Jenkins hash used in PG elsewhere, it will succeed very fast.
So if it fails on new keywords, making the mixing a bit stronger should
be enough.

Joerg

In response to

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables) at 2019-01-04 20:31:11 from Tom Lane

Responses

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables) at 2019-01-04 21:43:39 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2019-01-04 21:17:00	Re: BUG #15446: Crash on ALTER TABLE
Previous Message	Dmitry Molotkov	2019-01-04 21:01:15	Re: BUG #15446: Crash on ALTER TABLE