Re: benchmarking Flex practices

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: benchmarking Flex practices
Date: 2019-07-20 20:14:25
Message-ID: 18378.1563653665@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> writes:
> The pre-existing ecpg var "state_before" was a bit confusing when
> combined with the new var "state_before_quote_stop", and the former is
> also used with C-comments, so I decided to go with
> "state_before_lit_start" and "state_before_lit_stop". Even though
> comments aren't literals, it's less of a stretch than referring to
> quotes. To keep things consistent, I went with the latter var in psql
> and core.

Hm, what do you think of "state_before_str_stop" instead? It seems
to me that both "quote" and "lit" are pretty specific terms, so
maybe we need something a bit vaguer.

> To get the regression tests to pass, I had to add this:
> psql_scan_in_quote(PsqlScanState state)
> {
> - return state->start_state != INITIAL;
> + return state->start_state != INITIAL &&
> + state->start_state != xqs;
> }
> ...otherwise with parens we sometimes don't get the right prompt and
> we get empty lines echoed. Adding xuend and xuchar here didn't seem to
> make a difference. There might be something subtle I'm missing, so I
> thought I'd mention it.

I think you would see a difference if the regression tests had any cases
with blank lines between a Unicode string/ident and the associated
UESCAPE and escape-character literal.

While poking at that, I also came across this unhappiness:

regression=# select u&'foo' uescape 'bogus';
regression'#

that is, psql thinks we're still in a literal at this point. That's
because the uesccharfail rule eats "'b" and then we go to INITIAL
state, so that consuming the last "'" puts us back in a string state.
The backend would have thrown an error before parsing as far as the
incomplete literal, so it doesn't care (or probably not, anyway),
but that's not an option for psql.

My first reaction as to how to fix this was to rip the xuend and
xuchar states out of psql, and let it just lex UESCAPE as an
identifier and the escape-character literal like any other literal.
psql doesn't need to account for the escape character's effect on
the meaning of the Unicode literal, so it doesn't have any need to
lex the sequence as one big token. I think the same is true of ecpg
though I've not looked really closely.

However, my second reaction was that maybe you were on to something
upthread when you speculated about postponing de-escaping of
Unicode literals into the grammar. If we did it like that then
we would not need to have this difference between the backend and
frontend lexers, and we'd not have to worry about what
psql_scan_in_quote should do about the whitespace before and after
UESCAPE, either.

So I'm feeling like maybe we should experiment to see what that
solution looks like, before we commit to going in this direction.
What do you think?

> With the unicode escape rules brought over, the diff to the ecpg
> scanner is much cleaner now. The diff for C-comment rules were still
> pretty messy in comparison, so I made an attempt to clean that up in
> 0002. A bit off-topic, but I thought I should offer that while it was
> fresh in my head.

I didn't really review this, but it looked like a fairly plausible
change of the same ilk, ie combine rules by adding memory of the
previous start state.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2019-07-20 22:42:35 Re: Support for jsonpath .datetime() method
Previous Message Andres Freund 2019-07-20 20:03:10 Re: [RFC] Removing "magic" oids