Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From: Joerg Sonnenberger <joerg(at)bec(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joerg Sonnenberger <joerg(at)bec(dot)de>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, John Naylor <jcnaylor(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date: 2019-01-06 20:11:57
Message-ID: 20190106201157.GA546@britannica.bec.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote:
> * It's too bad that the hash function doesn't have a return convention
> that allows distinguishing "couldn't possibly match any keyword" from
> "might match keyword 0". I imagine a lot of the zero entries in its
> hashtable could be interpreted as the former, so at the cost of perhaps
> a couple more if-tests we could skip work at the caller. As this patch
> stands, we could only skip the strcmp() so it might not be worth the
> trouble --- but if we use Joerg's |0x20 hack then we could hash before
> downcasing, allowing us to skip the downcasing step if the word couldn't
> possibly be a keyword. Likely that would be worth the trouble.

The hash function itself doesn't have enough data in it to know whether
its a match or not. A strcmp or memcmp at the end will always be
necessary if you don't know already that it is a keyword.

> * We should extend the ScanKeywordList representation to include a
> field holding the longest keyword length in the table, which
> gen_keywordlist.pl would have no trouble providing. Then we could
> skip downcasing and/or hashing for any word longer than that, replacing
> the current NAMEDATALEN test, and thereby putting a tight bound on
> the cost of downcasing and/or hashing.

Correct, possibly even have an array for each class of keywords.

> * If we do hash first, then we could replace the downcasing loop and
> strcmp with an integrated loop that downcases and compares one
> character at a time, removing the need for the NAMEDATALEN-sized
> buffer variable.

This is also an option. Assuming that the character set for keywords
doesn't change (letter or _), one 64bit bit test per input character
would ensure that the |0x20 hack gives correct result for comparing as
well. Any other input would be an instant mismatch. If digits are valid
keyword characters, it would be two tests.

> * I think it matters to the speed of the hashing loop that the
> magic multipliers are hard-coded. (Examining the assembly code
> shows that, at least on my x86_64 hardware, gcc prefers to use
> shift-and-add sequences here instead of multiply instructions.)

Right, that's one of the reasons for choosing them. The other is that it
gives decent mixing for ASCII-only input. At the moment, they are
hard-coded.

> So we probably can't have inlined hashing code --- I imagine the
> hash generator needs the flexibility to pick different values of
> those multipliers.

Right now, only the initial values are randomized. Picking a different
set of hash functions is possible, but someone that should be done only
if there is an actual need. That was what I meant with stronger mixing
might be necessary for "annoying" keyword additions.

> I envision making this work by having
> gen_keywordlist.pl emit a function definition for the hash step and
> include a function pointer to it in ScanKeywordList. That extra
> function call will make things fractionally slower than what I have
> here, but I don't think it will change the conclusions any. This
> approach would also greatly alleviate the concern I had yesterday
> about ecpg's c_keywords.c having a second copy of the hashing code;
> what it would have is its own generated function, which isn't much
> of a problem.

There are two ways for dealing with it:
(1) Have one big hash table with all the various keywords and a class
mask stored. If there is enough overlap between the keyword tables, it
can significantly reduce the amount of space needed. In terms of code
complexity, it adds one class check at the end, i.e. a bitmap test.
(2) Build independent hash tables for each input class. A bit simpler to
manage, but can result in a bit wasted space.

From the generator side, it doesn't matter which choice is taken.

> * Given that the generator's runtime is negligible when coded in C,
> I suspect that we might able to tolerate the speed hit from translating
> it to Perl, and frankly I'd much rather do that than cope with the
> consequences of including C code in our build process.

I'm just not fluent enough in Perl to be much help for that, but I can
sit down and write a trivial Python version of it :) There are a couple
of changes that are useful to have in this context, e.g. the ability to
directly provide the offsets in the result table to allow dropping the
index -> offset table completely.

Joerg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-01-06 20:24:55 Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Previous Message Tom Lane 2019-01-06 19:29:05 Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)