Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: John Naylor <jcnaylor(at)gmail(dot)com>
Cc: Joerg Sonnenberger <joerg(at)bec(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date: 2019-01-06 14:11:55
Message-ID: CAKJS1f_-RNZmKPfpcLiFc810wB5NSyD_yACUBU1hFcUQe19zKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 5 Jan 2019 at 09:20, John Naylor <jcnaylor(at)gmail(dot)com> wrote:
>
> On 1/3/19, Joerg Sonnenberger <joerg(at)bec(dot)de> wrote:
> > Hello John,
> > I was pointed at your patch on IRC and decided to look into adding my
> > own pieces. What I can provide you is a fast perfect hash function
> > generator. I've attached a sample hash function based on the current
> > main keyword list. hash() essentially gives you the number of the only
> > possible match, a final strcmp/memcmp is still necessary to verify that
> > it is an actual keyword though. The |0x20 can be dropped if all cases
> > have pre-lower-cased the input already. This would replace the binary
> > search in the lookup functions. Returning offsets directly would be easy
> > as well. That allows writing a single string where each entry is prefixed
> > with a type mask, the token id, the length of the keyword and the actual
> > keyword text. Does that sound useful to you?
>
> Judging by previous responses, there is still interest in using
> perfect hash functions, so thanks for this. I'm not knowledgeable
> enough to judge its implementation, so I'll leave that for others.

Well, I'm quite impressed by the resulting hash function. The
resulting hash value can be used directly to index the existing 440
element ScanKeywords[] array (way better than the 1815 element array
Andrew got from gperf). If we also happened to also store the length
of the keyword in that array then we could compare the length of the
word after hashing. If the length is the same then we could perform a
memcmp() to confirm the match, should be a little cheaper than a
strcmp() and we should be able to store the length for free on 64-bit
machines. If the length is not the same then it's not a keyword.

It may also save some cycles to determine the input word's length at
the same time as lowering it.

The keyword length could also be easily determined by changing the
PG_KEYWORD macro to become:

#define PG_KEYWORD(a,b,c) {a,0,c,sizeof(a)-1},

after, of course adding a new field to the ScanKeyword struct.

What I'm most interested in is how long it took to generate the hash
function in hash2.c?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-01-06 14:14:58 Re: chained transactions
Previous Message Fabien COELHO 2019-01-06 13:16:14 Re: chained transactions