scanner/parser minimization

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: scanner/parser minimization
Date: 2013-02-28 20:34:32
Message-ID: CA+TgmoaaYvJ7yDKJHrWN1BVk_7fcV16rvc93udSo59gfxG_t7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Today's b^Hdiscussion on materialized views reminded me that I spent a
little bit of time looking at gram.y and thinking about what we might
be able to do to reduce the amount of bloat it spits out. On my
system, without debugging symbols, gram.o is 1019260 bytes. Using nm
gram.o | sort | less to compare the starting addresses of each symbol
with the next symbol, I figured out the size of some of the larger
symbols: yy_transition is 516288 bytes, and yycheck and yytable are
each 145984 bytes. Thus, anything which doesn't reduce the size of
one of those three symbols isn't going to help very much.

yy_transition appears in scan.c - that is, it's part of the flex
output, not the yacc output. It is an array of 64,535 yy_trans_info
structures, each containing two flex_int32_t members. Off-hand the
values all look small enough to fit in 2-byte integers rather than
4-byte integers; it's not clear to me why flex doesn't do that. It's
possible to vastly reduce the size of the scanner output, and
therefore of gram.c, by running flex with -Cf rather than -CF, which
changes the table representation completely. I assume there is a
sound performance reason why we don't do this, but it might be worth
checking that if we haven't lately. When compiled with -Cf, the size
of gram.o drops from 1019260 bytes to 703600, which is a large
savings.

yytable and yycheck are both part of the parser; that is, they appear
in gram.c. Each is an array of 2-byte integers. I'm not clear on
exactly how these relate to the size of the state table, but it seems
that the parser has the idea that each state has a list of actions
associated with specific next-tokens, and then it also has a "default
action" which is followed for next-tokens for which no specific rule
is present. I believe that yytable and yycheck somehow encode these
transition tables, but the details are not altogether clear to me. If
that view of the situation is correct, then a reasonable way of
approaching the task of reducing the parser size is to look for states
in gram.output (generated by running bison with the -v option) that
have a lot of non-default rules and mulling over how we might reduce
that number.

A whole lot of those state transitions are attributable to states
which have separate transitions for each of many keywords. They
transition to states which then reduce the corresponding keyword to
unreserved_keyword, col_name_keyword, type_func_keyword, or
reserved_keyword. So it would seem that if we could reduce those
transitions in some way, it would help a lot. I experimented with
this a little. Consider the following patch:

--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -12489,8 +12489,6 @@ SignedIconst: Iconst
{ $$ = $1; }
/* Column identifier --- names that can be column, table, etc names.
*/
ColId: IDENT
{ $$ = $1; }
- | unreserved_keyword
{ $$ = pstrdup($1); }
- | col_name_keyword
{ $$ = pstrdup($1); }
;

/* Type/function identifier --- names that can be type or function names.

On my machine, that two-line patch has the effect of reducing the size
of gram.o from 1019260 bytes to 844844 bytes, a savings of 164kB.
Pretty good for ripping out two lines of code. Applying
similarly-crude hacks to type_function_name or ColLabel is not nearly
as effective. If I hack up all three, gram.o drops to 794612 bytes, a
savings of 49 additional kB. If I hack up ONLY one of the other two
and NOT ColId, the savings are much less. Clearly ColId is the big
offender by far. I suspect, but am not positive, that this is simply
because ColId is more widely-used throughout the grammar. For
example, the following patch actually makes gram.o about 4kB larger:

--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -5381,7 +5381,7 @@ SecLabelStmt:
}
;

-opt_provider: FOR ColId_or_Sconst { $$ = $2; }
+opt_provider: FOR ColLabel { $$ = $2; }
| /* empty */ { $$ = NULL; }
;

Accepting more rather than fewer keywords in that position makes
things worse, which makes sense.

Interestingly, this analysis suggests that while reserved keywords and
type_func_name keywords are surely worse from an application
compatibility standpoint, unreserved and col_name_keywords are worse
from a parser bloat standpoint, because there are more contexts where
we have to laboriously ignore their status as keywords, via additional
parser state transitions.

I don't have a brilliant idea on what to do about this, but one
thought that did occur to me is that we might be able to use mid-rule
actions to change the scanner behavior. In other words, suppose we
reach a point in the input string where we know that we no longer care
about parsing unreserved keywords as keywords, but are instead happy
to have them returned as IDENT. We could then, at that point in the
rule, do something like { scan_unreserved_keywords = false; }, and the
{identifier} production in scan.l would then behave differently based
on the state of that flag. That would in turn allow gram.y symbols
used after that point in the rule to NOT include unreserved_keywords
alongside IDENT, which would slice out a bunch of state transitions.

There are a couple of problems with this idea. The complexity of
maintaining it is surely one, since we'd need duplicate productions
for certain things depending on the context in which they were
expected to be used. We could perhaps use Assert() against the
scan_unreserved_keywords flag (or its moral equivalent) to catch
coding mistakes. A more serious objection is that there are several
different categories of unreserved keywords and they don't all fit
neatly into this concept. For example, ENCRYPTED and UNENCRYPTED are
only used as keywords within ALTER ROLE (what a waste!), so somehow
shutting off recognition of those keywords elsewhere in the grammar
might work OK. But ZONE can appear in an arbitrary a_expr and must be
recognized there. It manages to be unreserved only because it
invariably follows TIME. If you can't flip the "don't scan unreserved
keywords" switch in any place where an a_expr might appear, it seems
to me that this idea is unlikely to apply in enough situations to
bring much benefit. Also, the mid-rule actions create new parser
states, which eats away at the gains we might otherwise make through
such tricks.

So all in all I'm not quite sure where to go with this, but I thought
that the foregoing analysis might be interesting enough to be worth
posting, in case it inspires someone else with an idea that seems
worth pursuing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-02-28 21:09:11 Re: scanner/parser minimization
Previous Message Tom Lane 2013-02-28 20:20:42 Re: sql_drop Event Trigger