Speeding up the Postgres lexer

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Speeding up the Postgres lexer
Date: 2005-05-23 16:31:35
Message-ID: 8652.1116865895@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was doing some profiles recently that showed that on simple statements
(like INSERT with a few column values) the basic parser (flex/bison) was
taking up a noticeable percentage of the total CPU time. We already
have done some things to speed up the lexer, like use -CF option for
large-but-fast tables. I read in the flex manual about how you can make
it faster yet if you can avoid the need for backup states. It's not
terribly well documented, but I eventually worked out how to do it, and
find that it seems to make the lexer about a third faster. Basically
what you have to do is provide alternate rules that allow the lexer to
match *some* rule on any prefix of a complete token. For instance the
rule for {real} implies that given
123.456E+
if the next character is not a digit, then the {real} rule fails, and
the lexer has to backtrack to
123.456
which it can recognize as a {decimal}. You can add a rule that does
match in this situation, throws back the two unwanted characters with
yyless(), and returns the number as the {decimal} rule would do.
The next call will carry on lexing the "E" as an identifier, etc.
(Note: this is going to end up being a syntax error because the SQL
grammar never allows a number to immediately precede an identifier;
but that's not the lexer's concern, and so I don't think that we should
make the extra rule throw an error immediately.)

What I'm wondering is whether this is really worth doing or not.
There are currently just two parts of the lexer rules that are affected
--- the {real} rule illustrated above, and the rules that allow quoted
strings to be split across lines as the SQL spec requires. But the
patches are still pretty ugly, and what's really annoying is that there
doesn't seem to be any way to get flex to complain if someone later
makes a change that breaks the no-backup-cases property again.

A prototype patch (completely undocumented :-() is attached. Any
thoughts about whether it's worth pressing forward with?

regards, tom lane

*** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005
--- src/backend/parser/scan.l Mon May 23 01:05:54 2005
***************
*** 148,163 ****
* validate the contents.
*/
xbstart [bB]{quote}
! xbstop {quote}
xbinside [^']*
xbcat {quote}{whitespace_with_newline}{quote}

/* Hexadecimal number
*/
xhstart [xX]{quote}
! xhstop {quote}
xhinside [^']*
xhcat {quote}{whitespace_with_newline}{quote}

/* National character
*/
--- 148,165 ----
* validate the contents.
*/
xbstart [bB]{quote}
! xbstop {quote}{whitespace}*
xbinside [^']*
xbcat {quote}{whitespace_with_newline}{quote}
+ xbconfused {quote}{whitespace}*"-"

/* Hexadecimal number
*/
xhstart [xX]{quote}
! xhstop {quote}{whitespace}*
xhinside [^']*
xhcat {quote}{whitespace_with_newline}{quote}
+ xhconfused {quote}{whitespace}*"-"

/* National character
*/
***************
*** 169,180 ****
*/
quote '
xqstart {quote}
! xqstop {quote}
xqdouble {quote}{quote}
xqinside [^\\']+
xqescape [\\][^0-7]
xqoctesc [\\][0-7]{1,3}
xqcat {quote}{whitespace_with_newline}{quote}

/* $foo$ style quotes ("dollar quoting")
* The quoted string starts with $foo$ where "foo" is an optional string
--- 171,183 ----
*/
quote '
xqstart {quote}
! xqstop {quote}{whitespace}*
xqdouble {quote}{quote}
xqinside [^\\']+
xqescape [\\][^0-7]
xqoctesc [\\][0-7]{1,3}
xqcat {quote}{whitespace_with_newline}{quote}
+ xqconfused {quote}{whitespace}*"-"

/* $foo$ style quotes ("dollar quoting")
* The quoted string starts with $foo$ where "foo" is an optional string
***************
*** 185,190 ****
--- 188,194 ----
dolq_start [A-Za-z\200-\377_]
dolq_cont [A-Za-z\200-\377_0-9]
dolqdelim \$({dolq_start}{dolq_cont}*)?\$
+ dolqfailed \${dolq_start}{dolq_cont}*
dolqinside [^$]+

/* Double quote
***************
*** 247,253 ****

integer {digit}+
decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))

param \${integer}

--- 251,259 ----

integer {digit}+
decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real ({integer}|{decimal})[Ee][-+]?{digit}+
! realfail1 ({integer}|{decimal})[Ee]
! realfail2 ({integer}|{decimal})[Ee][-+]

param \${integer}

***************
*** 310,315 ****
--- 316,325 ----
/* ignore */
}

+ <xc>\*+ {
+ /* ignore */
+ }
+
<xc><<EOF>> { yyerror("unterminated /* comment"); }

{xbstart} {
***************
*** 329,334 ****
--- 339,350 ----
yylval.str = litbufdup();
return BCONST;
}
+ <xb>{xbconfused} {
+ yyless(yyleng-1);
+ BEGIN(INITIAL);
+ yylval.str = litbufdup();
+ return BCONST;
+ }
<xh>{xhinside} |
<xb>{xbinside} {
addlit(yytext, yyleng);
***************
*** 356,361 ****
--- 372,383 ----
yylval.str = litbufdup();
return XCONST;
}
+ <xh>{xhconfused} {
+ yyless(yyleng-1);
+ BEGIN(INITIAL);
+ yylval.str = litbufdup();
+ return XCONST;
+ }
<xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); }

{xnstart} {
***************
*** 385,390 ****
--- 407,418 ----
yylval.str = litbufdup();
return SCONST;
}
+ <xq>{xqconfused} {
+ yyless(yyleng-1);
+ BEGIN(INITIAL);
+ yylval.str = litbufdup();
+ return SCONST;
+ }
<xq>{xqdouble} {
addlitchar('\'');
}
***************
*** 413,418 ****
--- 441,447 ----
BEGIN(xdolq);
startlit();
}
+ {dolqfailed} { yyerror("invalid token"); }
<xdolq>{dolqdelim} {
if (strcmp(yytext, dolqstart) == 0)
{
***************
*** 435,440 ****
--- 464,472 ----
<xdolq>{dolqinside} {
addlit(yytext, yyleng);
}
+ <xdolq>{dolqfailed} {
+ addlit(yytext, yyleng);
+ }
<xdolq>. {
/* This is only needed for $ inside the quoted text */
addlitchar(yytext[0]);
***************
*** 576,582 ****
yylval.str = pstrdup(yytext);
return FCONST;
}
!

{identifier} {
const ScanKeyword *keyword;
--- 608,623 ----
yylval.str = pstrdup(yytext);
return FCONST;
}
! {realfail1} {
! yyless(yyleng-1);
! yylval.str = pstrdup(yytext);
! return FCONST;
! }
! {realfail2} {
! yyless(yyleng-2);
! yylval.str = pstrdup(yytext);
! return FCONST;
! }

{identifier} {
const ScanKeyword *keyword;

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-05-23 16:42:10 Re: Speeding up the Postgres lexer
Previous Message Hannu Krosing 2005-05-23 16:20:16 Re: PATCH to allow concurrent VACUUMs to not lock each