Re: Remove useless associativity/precedence from parsers

From: Akim Demaille <akim(at)lrde(dot)epita(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Remove useless associativity/precedence from parsers
Date: 2019-05-22 19:20:31
Message-ID: 74DD0F55-F3CF-447E-ACF2-88C01E42AAC3@lrde.epita.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

> Le 21 mai 2019 à 21:06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> a écrit :
>
> Akim Demaille <akim(at)lrde(dot)epita(dot)fr> writes:
>>> Le 20 mai 2019 à 15:54, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> a écrit :
>>> 2013? Certainly not. We have a lot of buildfarm critters running
>>> older platforms than that.
>
>> This I fully understand. However, Bison is a source generator,
>> and it's quite customary to use modern tools on the maintainer
>> side, and then deploy the result them on possibly much older
>> architectures.
>> Usually users of Bison build tarballs with the generated parsers
>> in them, and ship/test from that.
>
> As do we, but at the same time we don't want to make our tool
> requirements too onerous. I think that really the practical limit
> right now is Bison 2.3 --- Apple is still shipping that as their system
> version,

And do not expect Apple to update it at all. Apple refuses the
GPLv3, and stopped updating Bison to the last GPLv2 release,
as it did for every GPL'd program.

> so requiring something newer than 2.3 would put extra burden
> on people doing PG development on Macs (of which there are a lot).

Honestly, I seriously doubt that you have contributors that don't
have MacPorts or Brew installed, and both are pretty up to date on
Bison.

>> So Bison, and your use of it today, is exactly what you need?
>> There's no limitation of that tool that you'd like to see
>> address that would make it a better tool for PostgreSQL?
>
> Well, there are a couple of pain points, but they're not going to be
> addressed by marginal hacking on declarations ;-). The things that
> we find really painful, IMV, are:
>
> * Speed of the generated parser could be better.

Expect news this year about that. I precisely came to look at
PostgreSQL for this. Is there an easy way to bench pg and the various
costs? To be explicit: is there a way to see how long the parsing
phase takes? And some mighty inputs to bench against?

> I suspect this has
> a lot to do with the fact that our grammar is huge, and so are the
> tables, and that causes lots of cache misses. Maybe this could be
> addressed by trying to make the tables smaller and/or laid out in
> a way with better cache locality?

The improvement I have in mind is about LR techniques, not about
this. But you are right that it might be interesting.

It's unlikely that the table can be made smaller though. Bison
was designed when space was really scarce, and a lot of efforts
were invested to make the tables as small as possible. The
current trend is actually to optionally consume more space in
exchange for better services (such as more accurate error messages).

> * Lack of run-time extensibility of the parser. There are many PG
> extensions that wish they could add things into the grammar, and can't.

Making the grammars extensible definitely makes sense, and it's
in the wishlist. But making this doable at runtime is a much bigger
problem...

However, maybe this can be achieved by calling the plugin parser
from the outer parser. Provided, of course, that the grammar of
the plugin is really in a "separate world"; if it also wants to
get bits of the host grammar, it's certainly not so easy.

Are there documented examples of this? What would that look like?

> * LALR(1) parsing can only barely cope with SQL, and the standards
> committee keeps making it harder.

But Bison does more: it also provides support for LR(1) and IELR(1),
which accept more (deterministic) grammars, and are not subject
to "mysterious s/r conflicts" as in LALR. But maybe you refer to
even beyond LR(1):

> We've got some hacks that fake
> an additional token of lookahead in some places, but that's just an
> ugly (and performance-eating) hack.

More than k=1 is unlikely to happen. Given that we have GLR, which
provides us with k=∞ :)

> Maybe Bison's GLR mode would already solve that,

No doubt about that.

Bison's grammar is not LR(1) either, because the rules are not mandated
to end with ';', so when reading a grammar, in a sequence such as
"<ID> <:> <ID>" the parser cannot know whether the second <ID> is the
RHS of the rule introduced by the first <ID>, or the beginning of another
rule if the sequence is actually "<ID> <:> <ID> <:>". Because of that,
Bison is also playing dirty tricks to turn this LR(2) into LR(1).

But as a consequence the grammar is much harder to evolve, the locations
are less accurate (because two tokens are merged together into a mega
token), etc.

So it is considered to turn Bison's own parser to GLR.

> but no one here has really looked into whether
> it could improve matters or whether it'd come at a performance cost.

That should be very easy to check: just adding %glr to the grammar
should not change the API, should not change the visible behavior,
but will give you a hint of the intrinsic cost of the GLR backend.

> The Bison manual's description of GLR doesn't give me a warm feeling
> either about the performance impact

The GLR backend is efficient... for a GLR backend. I have not
benched it against the deterministic backend, but now that you
mention it, it's an obvious need...

> or whether we'd still get
> compile-time warnings about bogus grammars.

Well, you can't fight nature here. Ambiguity of a grammar is
undecidable.

That being said, there's a modified Bison which implements heuristics
to detect ambiguities in grammar [1]. Would that make you feel more
comfortable?

[1] http://www.lsv.fr/~schmitz/pub/expamb.pdf

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-05-22 19:21:53 Re: "long" type is not appropriate for counting tuples
Previous Message Andreas Seltenreich 2019-05-22 19:15:40 Excessive memory usage in multi-statement queries w/ partitioning