Re: scanner/parser minimization

From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: scanner/parser minimization
Date: 2013-03-02 15:05:59
Message-ID: CAM-w4HOECQzDCUUOjBfTtpW-LG+S-U2aOtYNy0PsO6eKqNAx_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Regarding yytransition I think the problem is we're using flex to
implement keyword recognition which is usually not what it's used for.
Usually people use flex to handle syntax things like quoting and
numeric formats. All identifiers are handled by flex as equivalent.
Then the last step in the scanner for identifiers is to look up the
identifier in a hash table and return the keyword token if it's a
keyword. That would massively simplify the scanner tables.

This hash table can be heavily optimized because it's a static lookup.
c.f. http://www.gnu.org/software/gperf/

In theory this is more expensive since it needs to do a strcmp in
addition to scanning the identifier to determine whether the token
ends. But I suspect in practice the smaller tables might outweight
that cost.

On Thu, Feb 28, 2013 at 9:09 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I believe however that it's possible to extract an idea of which
> tokens the parser believes it can see next at any given parse state.
> (I've seen code for this somewhere on the net, but am too lazy to go
> searching for it again right now.) So we could imagine a rule along
> the lines of "if IDENT is allowed as a next token, and $KEYWORD is
> not, then return IDENT not the keyword's own token".

That's a pretty cool idea. I'm afraid it might be kind of slow to
produce that list and load it into a hash table for every ident token
though. I suppose if you can request it only if the ident is a keyword
of some type then it wouldn't actually kick in often.

That would also imply we could simply use IDENT everywhere we
currently have col_name_keyword or type_function_name. Just by having
a rule that accepts a keyword that would implicitly make it not be
accepted as an IDENT when unquoted in that location. That might make
the documentation a bit trickier and it would make it harder for users
to make their code forward-compatible. It would also make it harder
for hackers to determine when we've accidentally narrowed the
allowable identifiers for more cases than they expect.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-03-02 15:06:18 Re: Materialized views WIP patch
Previous Message Greg Stark 2013-03-02 14:36:28 Re: Materialized views WIP patch