Re: Speeding up the Postgres lexer

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Speeding up the Postgres lexer
Date: 2005-05-23 16:42:10
Message-ID: 200505231642.j4NGgA315728@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


This seems fine. I don't think the lexer changes enough for us to have
issues with new cases. I think adding some comments to explain why we
are doing it is enough, and perhaps a test case that can be reproduced
later for testing.

---------------------------------------------------------------------------

Tom Lane wrote:
> I was doing some profiles recently that showed that on simple statements
> (like INSERT with a few column values) the basic parser (flex/bison) was
> taking up a noticeable percentage of the total CPU time. We already
> have done some things to speed up the lexer, like use -CF option for
> large-but-fast tables. I read in the flex manual about how you can make
> it faster yet if you can avoid the need for backup states. It's not
> terribly well documented, but I eventually worked out how to do it, and
> find that it seems to make the lexer about a third faster. Basically
> what you have to do is provide alternate rules that allow the lexer to
> match *some* rule on any prefix of a complete token. For instance the
> rule for {real} implies that given
> 123.456E+
> if the next character is not a digit, then the {real} rule fails, and
> the lexer has to backtrack to
> 123.456
> which it can recognize as a {decimal}. You can add a rule that does
> match in this situation, throws back the two unwanted characters with
> yyless(), and returns the number as the {decimal} rule would do.
> The next call will carry on lexing the "E" as an identifier, etc.
> (Note: this is going to end up being a syntax error because the SQL
> grammar never allows a number to immediately precede an identifier;
> but that's not the lexer's concern, and so I don't think that we should
> make the extra rule throw an error immediately.)
>
> What I'm wondering is whether this is really worth doing or not.
> There are currently just two parts of the lexer rules that are affected
> --- the {real} rule illustrated above, and the rules that allow quoted
> strings to be split across lines as the SQL spec requires. But the
> patches are still pretty ugly, and what's really annoying is that there
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.
>
> A prototype patch (completely undocumented :-() is attached. Any
> thoughts about whether it's worth pressing forward with?
>
> regards, tom lane
>
> *** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005
> --- src/backend/parser/scan.l Mon May 23 01:05:54 2005
> ***************
> *** 148,163 ****
> * validate the contents.
> */
> xbstart [bB]{quote}
> ! xbstop {quote}
> xbinside [^']*
> xbcat {quote}{whitespace_with_newline}{quote}
>
> /* Hexadecimal number
> */
> xhstart [xX]{quote}
> ! xhstop {quote}
> xhinside [^']*
> xhcat {quote}{whitespace_with_newline}{quote}
>
> /* National character
> */
> --- 148,165 ----
> * validate the contents.
> */
> xbstart [bB]{quote}
> ! xbstop {quote}{whitespace}*
> xbinside [^']*
> xbcat {quote}{whitespace_with_newline}{quote}
> + xbconfused {quote}{whitespace}*"-"
>
> /* Hexadecimal number
> */
> xhstart [xX]{quote}
> ! xhstop {quote}{whitespace}*
> xhinside [^']*
> xhcat {quote}{whitespace_with_newline}{quote}
> + xhconfused {quote}{whitespace}*"-"
>
> /* National character
> */
> ***************
> *** 169,180 ****
> */
> quote '
> xqstart {quote}
> ! xqstop {quote}
> xqdouble {quote}{quote}
> xqinside [^\\']+
> xqescape [\\][^0-7]
> xqoctesc [\\][0-7]{1,3}
> xqcat {quote}{whitespace_with_newline}{quote}
>
> /* $foo$ style quotes ("dollar quoting")
> * The quoted string starts with $foo$ where "foo" is an optional string
> --- 171,183 ----
> */
> quote '
> xqstart {quote}
> ! xqstop {quote}{whitespace}*
> xqdouble {quote}{quote}
> xqinside [^\\']+
> xqescape [\\][^0-7]
> xqoctesc [\\][0-7]{1,3}
> xqcat {quote}{whitespace_with_newline}{quote}
> + xqconfused {quote}{whitespace}*"-"
>
> /* $foo$ style quotes ("dollar quoting")
> * The quoted string starts with $foo$ where "foo" is an optional string
> ***************
> *** 185,190 ****
> --- 188,194 ----
> dolq_start [A-Za-z\200-\377_]
> dolq_cont [A-Za-z\200-\377_0-9]
> dolqdelim \$({dolq_start}{dolq_cont}*)?\$
> + dolqfailed \${dolq_start}{dolq_cont}*
> dolqinside [^$]+
>
> /* Double quote
> ***************
> *** 247,253 ****
>
> integer {digit}+
> decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))
>
> param \${integer}
>
> --- 251,259 ----
>
> integer {digit}+
> decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real ({integer}|{decimal})[Ee][-+]?{digit}+
> ! realfail1 ({integer}|{decimal})[Ee]
> ! realfail2 ({integer}|{decimal})[Ee][-+]
>
> param \${integer}
>
> ***************
> *** 310,315 ****
> --- 316,325 ----
> /* ignore */
> }
>
> + <xc>\*+ {
> + /* ignore */
> + }
> +
> <xc><<EOF>> { yyerror("unterminated /* comment"); }
>
> {xbstart} {
> ***************
> *** 329,334 ****
> --- 339,350 ----
> yylval.str = litbufdup();
> return BCONST;
> }
> + <xb>{xbconfused} {
> + yyless(yyleng-1);
> + BEGIN(INITIAL);
> + yylval.str = litbufdup();
> + return BCONST;
> + }
> <xh>{xhinside} |
> <xb>{xbinside} {
> addlit(yytext, yyleng);
> ***************
> *** 356,361 ****
> --- 372,383 ----
> yylval.str = litbufdup();
> return XCONST;
> }
> + <xh>{xhconfused} {
> + yyless(yyleng-1);
> + BEGIN(INITIAL);
> + yylval.str = litbufdup();
> + return XCONST;
> + }
> <xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); }
>
> {xnstart} {
> ***************
> *** 385,390 ****
> --- 407,418 ----
> yylval.str = litbufdup();
> return SCONST;
> }
> + <xq>{xqconfused} {
> + yyless(yyleng-1);
> + BEGIN(INITIAL);
> + yylval.str = litbufdup();
> + return SCONST;
> + }
> <xq>{xqdouble} {
> addlitchar('\'');
> }
> ***************
> *** 413,418 ****
> --- 441,447 ----
> BEGIN(xdolq);
> startlit();
> }
> + {dolqfailed} { yyerror("invalid token"); }
> <xdolq>{dolqdelim} {
> if (strcmp(yytext, dolqstart) == 0)
> {
> ***************
> *** 435,440 ****
> --- 464,472 ----
> <xdolq>{dolqinside} {
> addlit(yytext, yyleng);
> }
> + <xdolq>{dolqfailed} {
> + addlit(yytext, yyleng);
> + }
> <xdolq>. {
> /* This is only needed for $ inside the quoted text */
> addlitchar(yytext[0]);
> ***************
> *** 576,582 ****
> yylval.str = pstrdup(yytext);
> return FCONST;
> }
> !
>
> {identifier} {
> const ScanKeyword *keyword;
> --- 608,623 ----
> yylval.str = pstrdup(yytext);
> return FCONST;
> }
> ! {realfail1} {
> ! yyless(yyleng-1);
> ! yylval.str = pstrdup(yytext);
> ! return FCONST;
> ! }
> ! {realfail2} {
> ! yyless(yyleng-2);
> ! yylval.str = pstrdup(yytext);
> ! return FCONST;
> ! }
>
> {identifier} {
> const ScanKeyword *keyword;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2005-05-23 17:26:27 Re: Obtaining Firing Statement clause in (pl/perlu) Trigger Function
Previous Message Tom Lane 2005-05-23 16:31:35 Speeding up the Postgres lexer