Re: benchmarking Flex practices

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: benchmarking Flex practices
Date: 2019-07-02 22:35:39
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> writes:
> 0001 is a small patch to remove some unneeded generality from the
> current rules. This lowers the number of elements in the yy_transition
> array from 37045 to 36201.

I don't particularly like 0001. The two bits like this

-whitespace ({space}+|{comment})
+whitespace ({space}|{comment})

seem likely to create performance problems for runs of whitespace, in that
the lexer will now have to execute the associated action once per space
character not just once for the whole run. Those actions are empty, but
I don't think flex optimizes for that, and it's really flex's per-action
overhead that I'm worried about. Note the comment in the "Performance"
section of the flex manual:

Another area where the user can increase a scanner's performance (and
one that's easier to implement) arises from the fact that the longer
the tokens matched, the faster the scanner will run. This is because
with long tokens the processing of most input characters takes place
in the (short) inner scanning loop, and does not often have to go
through the additional work of setting up the scanning environment
(e.g., `yytext') for the action.

There are a bunch of higher-order productions that use "{whitespace}*",
which is surely a bit redundant given the contents of {whitespace}.
But maybe we could address that by replacing "{whitespace}*" with
"{opt_whitespace}" defined as

opt_whitespace ({space}*|{comment})

Not sure what impact if any that'd have on table size, but I'm quite sure
that {whitespace} was defined with an eye to avoiding unnecessary
lexer action cycles.

As for the other two bits that are like

-<xe>. {
- /* This is only needed for \ just before EOF */
+<xe>\\ {

my recollection is that those productions are defined that way to avoid a
flex warning about not all possible input characters being accounted for
in the <xe> (resp. <xdolq>) state. Maybe that warning is
flex-version-dependent, or maybe this was just a worry and not something
that actually produced a warning ... but I'm hesitant to change it.
If we ever did get to flex's default action, that action is to echo the
current input character to stdout, which would be Very Bad.

As far as I can see, the point of 0002 is to have just one set of
flex rules for the various variants of quotecontinue processing.
That sounds OK, though I'm a bit surprised it makes this much difference
in the table size. I would suggest that "state_before" needs a less
generic name (maybe "state_before_xqs"?) and more than no comment.
Possibly more to the point, it's not okay to have static state variables
in the core scanner, so that variable needs to be kept in yyextra.
(Don't remember offhand whether it's any more acceptable in the other

regards, tom lane

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-07-02 22:46:22 Re: Fix two issues after moving to unified logging system for command-line utils
Previous Message Tom Lane 2019-07-02 20:56:01 Re: "long" type is not appropriate for counting tuples