Quick Links

WIP: Generic functions for Node types using generated metadata

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	WIP: Generic functions for Node types using generated metadata
Date:	2019-08-28 23:41:36
Message-ID:	20190828234136.fk2ndqtld3onfrrp@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

There's been a lot of complaints over the years about how annoying it is
to keep the out/read/copy/equalfuncs.c functions in sync with the actual
underlying structs.

There've been various calls for automating their generation, but no
actual patches that I am aware of.

There also recently has been discussion about generating more efficient
memory layout for node trees that we know are read only (e.g. plan trees
inside the plancache), and about copying such trees more efficiently
(e.g. by having one big allocation, and then just adjusting pointers).

One way to approach this problem would be to to parse the type
definitions, and directly generate code for the various functions. But
that does mean that such a code-generator needs to be expanded for each
such functions.

An alternative approach is to have a parser of the node definitions that
doesn't generate code directly, but instead generates metadata. And then
use that metadata to write node aware functions. This seems more
promising to me.

In the attached, *very experimental WIP*, patch I've played around with
this approach. I have written a small C program that uses libclang
(more on that later) to parse relevant headers, and generate metadata
about node types.

Currently the generated metadata roughly looks like this:

#define TYPE_CAT_SCALAR (1 << 0) // FIXME: this isn't well named rn
#define TYPE_CAT_POINTER (1 << 1)
#define TYPE_CAT_UNION (1 << 2) // FIXME: unused
#define TYPE_CAT_INCOMPLETE (1 << 3)

#define KNOWN_TYPE_UNKNOWN 0
#define KNOWN_TYPE_STRING 1
#define KNOWN_TYPE_INT 2
#define KNOWN_TYPE_NODE 3
#define KNOWN_TYPE_BITMAPSET 4

typedef struct NodeTypeInfo
{
/* FIXME: intern me */
const char *structname;
int16 nelems;
int16 start_at;
int16 size;
} NodeTypeInfo;

typedef struct NodeTypeComponents
{
/* FIXME: intern me */
const char *fieldname;
/* FIXME: intern me */
const char *fieldtype;
int16 offset;
int16 size;
int16 flags;
int16 node_type_id;
int16 known_type_id;
} NodeTypeComponents;

NodeTypeInfo node_type_info[] = {
{0},
{.structname = "IndexInfo", .nelems = 22, .start_at = 2128, .size = sizeof(IndexInfo)},
...
{.structname = "Append", .nelems = 4, .start_at = 1878, .size = sizeof(Append)},
...

NodeTypeComponents node_type_components[] = {
...
{.fieldname = "plan", .fieldtype = "struct Plan", .offset = offsetof(struct Append, plan), .size = sizeof(struct Plan), .flags = TYPE_CAT_SCALAR, .node_type_id = 9, .known_type_id = KNOWN_TYPE_UNKNOWN},
{.fieldname = "appendplans", .fieldtype = "struct List *", .offset = offsetof(struct Append, appendplans), .size = sizeof(struct List *), .flags = TYPE_CAT_POINTER, .node_type_id = 223, .known_type_id = KNOWN_TYPE_UNKNOWN},
{.fieldname = "first_partial_plan", .fieldtype = "int", .offset = offsetof(struct Append, first_partial_plan), .size = sizeof(int), .flags = TYPE_CAT_SCALAR, .node_type_id = -1, .known_type_id = KNOWN_TYPE_UNKNOWN},
{.fieldname = "part_prune_info", .fieldtype = "struct PartitionPruneInfo *", .offset = offsetof(struct Append, part_prune_info), .size = sizeof(struct PartitionPruneInfo *), .flags = TYPE_CAT_POINTER, .node_type_id = 53, .known_type_id = KNOWN_TYPE_UNKNOWN},
...

i.e. each node type is listed in node_type_info, indexed by the NodeTag
value. The fields building up the node are defined in
node_type_components. By including the size for Node structs, and
components, and by including the offset for components, we can write
generic routines accessing node trees.

In the attached I used this metadata to write a function that can
compute the amount of memory a node tree needs (without currently
considering memory allocator overhead). The function for doing so has a
pretty limited amount of type awareness - it needs to know about
strings, the various list types, value nodes and Bitmapset.

I'm fairly sure this metadata can also be used to write the other
currently existing node functions.

I *suspect* that it'll be quite OK performancewise, compared to bespoke
code, but I have *NOT* measured that.

The one set of fields this currently can not deal with is the various
arrays that we directly reference from nodes. For e.g.

typedef struct Sort
{
Plan plan;
int numCols; /* number of sort-key columns */
AttrNumber *sortColIdx; /* their indexes in the target list */
Oid *sortOperators; /* OIDs of operators to sort them by */
Oid *collations; /* OIDs of collations */
bool *nullsFirst; /* NULLS FIRST/LAST directions */
} Sort;

the generic code has no way of knowing that sortColIdx, sortOperators,
collations, nullsFirst are all numCols long arrays.

I can see various ways of dealing with that:

1) We could convert them all to lists, now that we have fast arbitrary
access. But that'd add a lot of indirection / separate allocations.

2) We could add annotations to the sourcecode, to declare that
association. That's probably not trivial, but wouldn't be that hard -
one disadvantage is that we probably couldn't use that declaration
for automated asserts etc.

3) We could introduce a few macros to create array type that include the
length of the members. That'd duplicate the lenght for each of those
arrays, but I have a bit of a hard time believing that that's a
meaningful enough overhead.

I'm thinking of a macro that'd be used like
ARRAY_FIELD(AttrNumber) *sortColIdx;
that'd generate code like
struct
{
size_t nmembers;
AttrNumber members[FLEXIBLE_ARRAY_MEMBER];
} *sortColIdx;

plus a set of macros (or perhaps inline functions + macros) to access
them.

With any of these we could add enough metadata for node type members to
be able to handle such arrays in a generic manner, rather than having to
write code for each of them.

With regards to using libclang for the parsing: I chose that because it
seemed the easiest to experiment with, compared to annotating all the
structs with enough metadata to be able to easily parse them from a perl
script. The node definitions are after all distributed over quite a few
headers.

I think it might even be the correct way forward, over inventing our own
mini-languages and writing ad-hoc parsers for those. It sure is easier
to understand plain C code, compared to having to understand various
embeded mini-languages consisting out of macros.

The obvious drawback is that it'd require more people to install
libclang - a significant imposition. I think it might be OK if we only
required it to be installed when changing node types (although it's not
necessarily trivial how to detect that node types haven't changed, if we
wanted to allow other code changes in the relevant headers), and stored
the generated code in the repository.

Alternatively we could annotate the code enough to be able to write our
own parser, or use some other C parser.

I don't really want to invest significantly more time into this without
first debating the general idea.

The attached patch really just is a fairly minimal prototype. It does
not:
- properly integrate into the buildsystem, to automatically re-generate
the node information when necessary - instead one has to
"make -C src/backend/catalog gennodes-run"
- solve the full problem (see discussion of referenced arrays above)
- actually use the metadata in a useful manner (there just is a function
that estimates node tree sizes, which is used for parse/plan trees)
- have clean code

One interesting - but less important - bit is that the patch currently
has the declared node type id available for node members, e.g:

typedef struct FieldStore
{
Expr xpr;
Expr *arg; /* input tuple value */
List *newvals; /* new value(s) for field(s) */
List *fieldnums; /* integer list of field attnums */
Oid resulttype; /* type of result (same as type of arg) */
/* Like RowExpr, we deliberately omit a typmod and collation here */
} FieldStore;

but we cannot actually rely on arg actually pointing to a node of type
Expr* - it'll always point to a "subclass". Obviously we can handle
that by dispatching using nodeTag(arg), but it'd be more efficient if we
didn't need that. And we also loose some error checking due to that.

I wonder if we could collect enough metadata to determine which
potential subclasses a node type has. And then dispatch more efficiently
when only one, and assert that it's a legal subclass for the other
cases.

Greetings,

Andres Freund

Attachment	Content-Type	Size
gennode.diff	text/x-diff	673.5 KB

Responses

Re: WIP: Generic functions for Node types using generated metadata at 2019-08-30 12:35:34 from Fabien COELHO
Re: WIP: Generic functions for Node types using generated metadata at 2019-09-20 05:18:57 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2019-08-29 00:09:05	Re: REINDEX filtering in the backend
Previous Message	Ryan Lambert	2019-08-28 21:56:54	Re: FETCH FIRST clause PERCENT option