Re: WIP: Generic functions for Node types using generated metadata

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Generic functions for Node types using generated metadata
Date: 2019-10-02 20:21:46
Message-ID: 20191002202146.t7kbmdnn4mct54rn@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-10-02 14:30:08 -0400, Robert Haas wrote:
> On Wed, Oct 2, 2019 at 12:03 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > I'm afraid that's going to be a deal-breaker for lots of people.
> > It's fine for prototyping the idea but we'll need to find another
> > implementation before we can move to commit.
>
> Why do you think it will be a deal-breaker for lots of people? I mean,
> if we get this to a point where it just requires installing some
> reasonably modern version of LLVM, I don't see why that's worse than
> having to do the same thing for say, Perl if you want to build
> --with-perl, or Python if you want to build --with-python, or bison or
> lex if you want to change the lexer and parser. One more build-time
> dependency shouldn't be a big deal, as long as we don't need a really
> specific version. Or am I missing something?

It shouldn't be that bad. On windows it can just be a binary to install:
http://releases.llvm.org/download.html
and newer versions of visual studio apparently even have GUI support for
installing LLVM + clang + ...

> > > One concern I have is about whether the
> > > code that uses LLVM is likely to be dependent on specific LLVM
> > > versions.

FWIW, after one forward-compatible code change (a convenience function
didn't exist), one configure check (3.7 is older than what we support
for JIT), gennodes.c generated exactly the same output with 3.7 and
the tip-of-tree libclang that I used for development.

I did not check further back, but 3.7 is plenty old.

The libclang API is part of the small set of "stable" APIs (which only
expose C) in LLVM, and the deprecation cycles are fairly long. That
obviously doesn't protect against accidentally adding dependencies on a
feature only available in newer libclang versions, but I don't think
we'd be likely to change the feature set of the node generation program
all that often, and we already have buildfarm animals using most llvm
versions.

> > Yeah, that's one of the reasons it's a deal-breaker. We've been able to
> > insist that everybody touching configure use the same autoconf version,
> > but I don't think that could possibly fly for LLVM. But without that,
> > all the derived files would change in random ways depending on who'd
> > committed last. Even if it's only whitespace changes, that'd be a mess.
>
> I don't really see a reason why that would be an issue here. The code
> Andres wrote just uses LLVM to parse the structure definitions from
> our header files; the code generation stuff is hand-rolled and just
> prints out C. It's basically two big arrays, one of which is indexed
> by NodeTag and thus in a fixed order, and the other of which is an
> array of all structure members of all node types. The latter doesn't
> seem to be sorted in any terribly obvious way at the moment --
> structure members are in order of occurrence within the corresponding
> definition, but the definitions themselves are not in any
> super-obvious order. That could presumably be fixed pretty easily,
> though.

Yea, I don't think there should be any big problem here. The order
already is "consistent" between versions etc, and only depends on the
content of our include files. It's not quite right yet, because adding
a structure member currently can causes a too big diff in the generated
metadata file, due to renumbering of array indexes included in struct
contents of structs.

For the "structure members" array that is probably best solved by simply
removing it, and instead referencing the members directly from within
the node aray. That'll slightly increase the size (pointer, rather than
uint16) of the "node structs" array, and make it harder to iterate over
the members of all structures (uh, why would you want that?), but is
otherwise easy.

The strings table, which is currently "uniqued", is suboptimal for
similar reasons, and also performance (one more lookup in a separate
array). I suspect the best way there too might be to just inline them,
instead of storing them de-duplicated.

Inlining the contents of those would also make it fairly easy to update
the file when doing small changes, if somebody doesn't want to/can't
install libclang. We could probably include a few static asserts (once
we can have them on file scope) in the generated node file, to make
mistakes more apparent.

If we decide that size is enough of an issue, we'd probably have to
output some macros to reduce the amount of visible
re-numbering. So instead of something like

{.name = 2201 /* Result */, .first_field_at = 1857, .num_fields = 2, .size = sizeof(Result)},
we'd have
{.name = TI_STRINGOFF_RESULT, .first_field_at = TI_FIRST_FIELD_RESULT, .num_fields = 2, .size = sizeof(Result)},

#define TI_STRINGOFF_RESULT (TI_STRINGOFF_RESCONSTQUAL + 1)
#define TI_FIRST_FIELD_RESULT (TI_FIRST_FIELD_PLAN + 15)

(with a bit more smarts for how to name the string #defines in a
non-conflicting way)

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-10-02 20:46:30 Re: WIP: Generic functions for Node types using generated metadata
Previous Message Magnus Hagander 2019-10-02 18:59:27 Re: Online checksums patch - once again