Re: hint infrastructure setup (v3)

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-05 09:12:53
Message-ID: Pine.LNX.4.58.0404050954430.13384@sablons.cri.ensmp.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches


Dear Tom,

First, thanks for the discussion about my "hint" infrastructure patch
submission. Whatever the opinion, it helps.

> We'd have to write our own version of bison's verbose-error code anyway,
> because the canned code doesn't support localization --- it uses
> hardwired strings for "expecting" and so on. But it looks possibly
> doable to me.

I also played a little bit with the parser over the weekend. I found the
same issues that you noted in your mail, tried more options, and found
other problems. I sent here the full mail I wrote.

Although I agree that it is "doable", I have stronger reserve than yours.
Also, I do not find it an appealing solution to change "gram.c" a lot.

The problems I found are:

(1) not available raw information.

The automaton stack keeps states, which are not directly linked to rules
and terminals. The terminals are not available, they must be kept
separatly if you want them. This can be done in yylex().

The internal state, stack, token... are local variables within yyparse().
As a result, they are not accessible from yyerror. I haven't found any
available hook, so you have to hack "gram.c" to get this information. It
is a 1 line hack, but it modifies the generated code. Or you have to
produce a new hacked "gram.c", which is what you suggest.

(2) hard to compute follow set

The information about the syntax is there, obviously! However, from a
given automaton state, the actually allowed tokens are not directly
available. Only some are directly available, so the result is:

DROP ? ;
ERROR: syntax error at or near "?" at character 6
HINT: hint status is state=422 n=796 char=575 type=320=Op stack= 0 16 422
tokens= 88:DROP 320:Op
expected= 150:LANGUAGE?

Other tokens may be found, but you have either:

- to *simulate* the automaton forward through reductions to check whether
it is finally accepted. Not really beautiful, but that can be done.
It is just the bison code without the reduction production rules;-)

- move backwards before doing the above, if some reductions where
performed because of the submitted token and finally resulted in the error,
the state that lead to the error may not be the highest available one, so
maybe other allowed tokens may also be missed. We would need to have
the last state before any reduction.
Another small hack in generated "gram.c" would be needed to get it.

However,

(3) the computed follow set is often useless, as you noted.

The reason for that is that keywords are accepted in place of identifiers.
The use of rules: reserved_keyword, func_name_keyword, col_name_keyword,
unreserved_keyword as possible column/function/... identifiers is IMHO the
main reason for the size of the generated automaton.

If you drop that, everything would be much simpler, because it would
reduce a lot the number of arcs in the automaton. In particular, anything
which is just an identifier should not be given a name (say, BOOLEAN_P or
SMALLINT should not be tokens, but could rather be recognized as a special
case of identifier from within the parser, maybe with some simple
post-filtering).

As you noted, for things like "SHOW 123", the follow set basically
includes all keywords although you can have SHOW ALL or SHOW name.

So, as you suggested, you can either say "ident" as a simplification, but
you miss ALL which is meaningful, or you say all keywords, which is
useless.

An alternative is to know that you're within a "SHOW something", that is
you somehow reparse the code from within the syntax error:-( It is
necessary if you want to say something more interesting than "ident", say
"option name". You may also get this information by simulating the
automaton forward, or noticing that some of the current on stack states
can only lead to the reduction of a ShowStmt...

Well, on the positive side, it is sometimes right, say:

REVOKE ? ;
ERROR: syntax error, unexpected Op at or near "?" at character 8
HINT: hint status is state=481 n=1297 char=575 type=320=Op
stack= 0 31 481
tokens= 234:REVOKE 320:Op
expected= 10:ALL 59:CREATE 80:DELETE_P 98:EXECUTE 136:INSERT 224:REFERENCES
239:RULE 244:SELECT 268:TEMP 270:TEMPORARY 279:TRIGGER 292:UPDATE
293:USAGE?

(4) the available automaton data are inverted with respect to what would
be needed... It would be nice to know whether we're heading for a create
statement or something like that, but the information is hidden down the
automaton. basically, the automaton would need to be "inverted" to compute
this information. We would need the initial information in "gram.y", which
was "inverted" to build the automaton. So the code somehow needs to undo
what bison compilation has done.

(5) anything that can be done would be hardwired to one version of bison.
There is a lot of asumptions in the code and data structures, and any
newer/older version with some different internal representation would
basically break any code that would rely on that. So postgres would not be
"bison" portable:-( I don't think it is an real option that old postgresql
source would be broken against future bison releases.

So, as far as I'm concerned, I don't find the internal way really
convincing as the way to go. Thus I can see three options:

(a) put hints in "gram.y" as I already suggested
+ I can do it
+ it can be incremental once the infrastructure is in place.
+ hints are simple to add or fix
= hints are not necessarily marvellous
- it would take time to develop
- it may slow down the parser, because of added production rules.
- it would change "gram.y" a lot and could make it harder to maintain
- patchers don't like it...

(b) write a new "recursive descendant" parser, and drop "gram.y"
+ I could do it
+ hints could be more intelligent
+ I think the parser would be faster, at least not slower
= I'm not sure it would be really easier to maintain, but maybe not harder
- it cannot be incremental at all
- hints are at the price of some programming
- it would take time to develop, but basically the structure would
look like "gram.y" a lot, and existing data structures can be kept.
- I'm not sure patchers would like it, and if it is thrown down the drain,
I would not be happy for the one who spent the time (say, me;-)

(c) do nothing.
+ I can do that quite easily;-)
+ patchers are ok with that
+ it does not take time
= the maintainability of "gram.y" is not changed.
- my students won't have hints.

A funny technical detail is that the refactoring which is needed for the
(a) option would be also needed for (b). (b) would require more time than
(a), but the results could be better for the hint/parser error handling.

I can have some time for this, especially over the next few months.
Incremental options looked a better way to me, but (a) seems stuck and (b)
is risky, and the internal way looks ugly.

As a side effect of my inspection is that the automaton generated by bison
could be simplified if a different tradeoff between the lexer, the parser
and the post-processing would be chosen. Namelly, all tokens that are
just identifiers should be dropped and processed differently.

Have a nice day,

--
Fabien Coelho - coelho(at)cri(dot)ensmp(dot)fr

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Peter Eisentraut 2004-04-05 09:15:18 Re: Translation Upadtes for 7.5: initdb-ru.po.gz; libpq-ru.po.gz; psql-ru.po.gz
Previous Message Claudio Natoli 2004-04-05 09:05:06 Re: MSFT compiler fixes + misc