Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: John Naylor <jcnaylor(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Joerg Sonnenberger <joerg(at)bec(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date: 2019-01-06 03:48:51
Message-ID: 25312.1546746531@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

John Naylor <jcnaylor(at)gmail(dot)com> writes:
> [ v6-0001-Use-offset-based-keyword-lookup.patch ]

I spent some time hacking on this today, and I think it's committable
now, but I'm putting it back up in case anyone wants to have another
look (and also so the cfbot can check it on Windows).

Given the discussion about possibly switching to perfect hashing,
I thought it'd be a good idea to try to make the APIs less dependent
on the exact table representation. So in the attached, I created
a struct ScanKeywordList that holds all the data ScanKeywordLookup
needs, and the generated headers declare variables of that type,
and we just pass around a pointer to that instead of passing several
different things.

I also went ahead with the idea of splitting the category and token
data into separate arrays. That allows moving the backend token
array out of src/common entirely, which I think is a good thing
because of the dependency situation: we no longer need to run the
bison build before we can compile src/common/keywords_srv.o.

There's one remaining refactoring issue that I think we'd want to consider
before trying to jack this up and wheel a perfect-hash lookup under it:
where to do the downcasing transform. Right now, ecpg's c_keywords.c
has its own copy of the binary-search logic because it doesn't want the
downcasing transform that ScanKeywordLookup does. So unless we want it
to also have a copy of the hash lookup logic, we need to rearrange that
somehow. We could give ScanKeywordLookup a "bool downcase" argument,
or we could refactor things so that the downcasing is done by callers
if they need it (which many don't). I'm not very sure which of those
three alternatives is best.

My argument upthread that we could always do the downcasing before
keyword lookup now feels a bit shaky, because I was reminded while
working on this code that we actually have different downcasing rules
for keywords and identifiers (yes, really), so that it's not possible
for those code paths to share a downcasing transform. So the idea of
moving the keyword-downcasing logic to the callers is likely to not
work out quite as nicely as I thought. (This might also mean that
I was overly hasty to reject Joerg's |0x20 hack. It's still an
ugly hack, but it would save doing the keyword downcasing transform
if we don't get a hashcode match...)

regards, tom lane

Attachment Content-Type Size
offset-based-keyword-lookup-7.patch text/x-diff 72.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-01-06 05:04:29 Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Previous Message David Rowley 2019-01-06 03:24:24 Re: Ordered Partitioned Table Scans