Re: benchmarking Flex practices

From: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: benchmarking Flex practices
Date: 2019-07-24 07:45:45
Message-ID: CACPNZCs1f3YfT4WGNGJOfJxQwHiJL+bj_0yzRyFyuWM=3E=q1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 21, 2019 at 3:14 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> writes:
> > The pre-existing ecpg var "state_before" was a bit confusing when
> > combined with the new var "state_before_quote_stop", and the former is
> > also used with C-comments, so I decided to go with
> > "state_before_lit_start" and "state_before_lit_stop". Even though
> > comments aren't literals, it's less of a stretch than referring to
> > quotes. To keep things consistent, I went with the latter var in psql
> > and core.
>
> Hm, what do you think of "state_before_str_stop" instead? It seems
> to me that both "quote" and "lit" are pretty specific terms, so
> maybe we need something a bit vaguer.

Sounds fine to me.

> While poking at that, I also came across this unhappiness:
>
> regression=# select u&'foo' uescape 'bogus';
> regression'#
>
> that is, psql thinks we're still in a literal at this point. That's
> because the uesccharfail rule eats "'b" and then we go to INITIAL
> state, so that consuming the last "'" puts us back in a string state.
> The backend would have thrown an error before parsing as far as the
> incomplete literal, so it doesn't care (or probably not, anyway),
> but that's not an option for psql.
>
> My first reaction as to how to fix this was to rip the xuend and
> xuchar states out of psql, and let it just lex UESCAPE as an
> identifier and the escape-character literal like any other literal.
> psql doesn't need to account for the escape character's effect on
> the meaning of the Unicode literal, so it doesn't have any need to
> lex the sequence as one big token. I think the same is true of ecpg
> though I've not looked really closely.
>
> However, my second reaction was that maybe you were on to something
> upthread when you speculated about postponing de-escaping of
> Unicode literals into the grammar. If we did it like that then
> we would not need to have this difference between the backend and
> frontend lexers, and we'd not have to worry about what
> psql_scan_in_quote should do about the whitespace before and after
> UESCAPE, either.
>
> So I'm feeling like maybe we should experiment to see what that
> solution looks like, before we commit to going in this direction.
> What do you think?

Given the above wrinkles, I thought it was worth trying. Attached is a
rough patch (don't mind the #include mess yet :-) ) that works like
this:

The lexer returns UCONST from xus and UIDENT from xui. The grammar has
rules that are effectively:

SCONST { do nothing}
| UCONST { esc char is backslash }
| UCONST UESCAPE SCONST { esc char is from $3 }

...where UESCAPE is now an unreserved keyword. To prevent shift-reduce
conflicts, I added UIDENT to the %nonassoc precedence list to match
IDENT, and for UESCAPE I added a %left precedence declaration. Maybe
there's a more principled way. I also added an unsigned char type to
the %union, but it worked fine on my compiler without it.

litbuf_udeescape() and check_uescapechar() were moved to gram.y. The
former had be massaged to give error messages similar to HEAD. They're
not quite identical, but the position info is preserved. Some of the
functions I moved around don't seem to have any test coverage, so I
should eventually do some work in that regard.

Notes:

-Binary size is very close to v6. That is to say the grammar tables
grew by about the same amount the scanner table shrank, so the binary
is still about 200kB smaller than HEAD.
-Performance is very close to v6 with the information_schema and
pgbench-like queries with standard strings, which is to say also very
close to HEAD. When the latter was changed to use Unicode escapes,
however, it was about 15% slower than HEAD. That's a big regression
and I haven't tried to pinpoint why.
-psql was changed to follow suit. It doesn't think it's inside a
string with your too-long escape char above, and it removes all blank
lines from this query output:

$ cat >> test-uesc-lit.sql
SELECT

u&'!0041'

uescape

'!'

as col
;

On HEAD and v6 I get this:

$ ./inst/bin/psql -a -f test-uesc-lit.sql

SELECT
u&'!0041'

uescape
'!'
as col
;
col
-----
A
(1 row)

-The ecpg changes here are only the bare minimum from HEAD to get it
to compile, since I'm borrowing its additional token names (although
they mean slightly different things). After a bit of experimentation,
it's clear there's a bit more work needed to get it functional, and
it's not easy to debug, so I'm putting that off until we decide
whether this is the way forward.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v7-draft-handle-uescapes-in-parser.patch application/octet-stream 39.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-07-24 07:50:53 Re: On the stability of TAP tests for LDAP
Previous Message David Rowley 2019-07-24 07:14:35 Re: Speed up transaction completion faster after many relations are accessed in a transaction