Re: Add CREATE support to event triggers

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add CREATE support to event triggers
Date: 2013-11-12 14:33:18
Message-ID: CA+TgmoZ=vZriJMxLkqi_V0jg4k4LEAPmwUSC6RWXS5MquXUJNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 8, 2013 at 10:33 AM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Attached you can find a very-much-WIP patch to add CREATE info support
> for event triggers (normalized commands). This patch builds mainly on
> two things:
>
> 1. Dimitri's "DDL rewrite" patch he submitted way back, in
> http://www.postgresql.org/message-id/m2zk1j9c44.fsf@2ndQuadrant.fr
>
> I borrowed the whole ddl_rewrite.c code, and tweaked it a bit. There
> are several things still wrong with it and which will need to be fixed
> before a final patch can even be contemplated; but there are some
> questions that require a consensus answer before I go and fix it all,
> because what it will look like will depend on said answers.

I'm still unhappy with this whole concept. It adds a significant
maintenance burden that must be carried by everyone who adds new DDL
syntax in the future, and it's easy to imagine this area of the code
ending up very poorly tested and rife with bugs. After all, event
triggers, as nifty as they are, are only going to be used by a small
minority of users. And new DDL features are typically going to be
things that are fairly obscure, so again they will only be used by a
small minority of users. I think we need to avoid the situation where
we have bugs that can only get found when those minorities happen to
intersect. If we're going to have DDL rewrite code, then we need a
way of making sure that it gets tested in a comprehensive way on a
regular basis.

Here's one idea: create a contrib module that (somehow, via APIs to be
invented) runs every DDL command that gets executed through the
deparsing code, and then parses the result and executes *that* instead
of the original command. Then, add a build target that runs the
regression test suite in that mode, and get the buildfarm configured
to run that build target regularly on at least some machines. That
way, adding syntax to the regular regression test suite also serves to
test that the deparsing logic for that syntax is working. If we do
this, there's still some maintenance burden associated with having DDL
deparsing code, but at least our chances of noticing when we've failed
to maintain it should be pretty good.

The other thing that bothers me here is that, while a normalized
command string sounds great in theory, as soon as you want to allow
(for example) mapping schema A on node 1 to schema B on node 2, the
wheels come off: you'll have to deparse that normalized command string
so you can change out the schema name and then reassemble it back into
a command string again. So we're going to parse the user input, then
deparse it, hand over the results to the application code, which will
then parse it, modify that, and deparse it again. At every step of
that process, any mistake will lead to subtle bugs in the resulting
system. Larry Wall once wrote (approximately) that a good programming
language makes simple things simple and hard things possible; I think
this design fails the second prong of that test.

Now, I guess I can live with that if it's what everyone else wants,
but I don't have a great feeling about the long-term utility of it.
Exposing the data from the DDL statement in some structured way - like
what we've done with drops, or a JSON blob, or something like that,
feels much more robust and useful than a normalized command string to
me. If the normalized command string can easily be built up from that
data, then you don't really need to expose the command string
separately. If it can't, then you're not doing a good job exposing
the data in a usable form. Saying "well, people can always parse the
normalized command string" is a cop-out. Parsing SQL is *hard*; the
only external project I know of that parses our SQL syntax well is
pgpool, and that's because they copy our parser wholesale, surely not
the sort of solution we want to foist off on event trigger authors.

Finally, I'm very skeptical of the word "normalized". To me, that
sounds like an alias for "modifying the command string in unspecified
ways that big brother thinks will be useful to event trigger authors".
Color me skeptical. What if somebody doesn't want their command
string normalized? What if they want it normalized in a way that's
different from the way that we've chosen to normalize it? I fear that
this whole direction amounts to "we don't know how to design a real
API so let's just do surgery on the command string and call whatever
pops out the API". Maybe that's harsh, but if it is I don't know why.
The normalization steps we build into this process constitute
assumptions about how the feature will be used, and they back the user
into using that feature in just that way and no other.

> 2. The ideas we used to build DROP support. Mainly, the interesting
> thing here is the fact that we use a SRF to report, at
> ddl_command_end, all the objects that were created during execution
> of that command. We do this by collecting them in a list in some raw
> form somewhere during ProcessUtility, and then spitting them out if
> the SRF is called. I think the general idea is sound, although of
> course I admit there might be bugs in the implementation.

Agreed.

> Note this patch doesn't try to add any kind of ALTER support. I think
> this is fine in principle, because we agreed that we would attack each
> kind of command separately (divide to conquer and all that);

Also agreed.

> but there
> is a slight problem for some kind of objects that are represented partly
> as ALTER state during creation; for example creating a table with a
> sequence uses ALTER SEQ/OWNED BY internally at some point. There might
> be other cases I'm missing, also. (The REFRESH command is nominally
> also supported.)

There are lots of places in the DDL code where we pass around
constructed parse trees as a substitute for real argument lists. I
expect that many of those places will eventually get refactored away,
so it's important that this feature does not end up relying on
accidents of the current code structure. For example, an
AlterTableStmt can actually do a whole bunch of different things in a
single statement: SOME of those are handled by a loop in
ProcessUtilitySlow() and OTHERS are handled internally by AlterTable.
I'm pretty well convinced that that division of labor is a bad design,
and I think it's important that this feature doesn't make that dubious
design decision into documented behavior.

> Now about the questions I mentioned above:
>
> a) It doesn't work to reverse-parse the statement nodes in all cases;
> there are several unfixable bugs if we only do that. In order to create
> always-correct statements, we need access to the catalogs for the
> created objects. But if we are doing catalog access, then it seems to
> me that we can do away with the statement parse nodes completely and
> just reconstruct the objects from catalog information. Shall we go that
> route?

That works well for CREATE and is definitely appealing in some ways;
it probably means needing a whole lot of what's in pg_dump in the
backend also. Of course, converting the pg_dump code to a library
that can be linked into either a client or the server would make a lot
of people happy. Making pg_dump much dumber (at least as regards
future versions) and relying on new backend code to serve the same
purpose would perhaps be reasonable as well, although I know Tom is
against it. But having two copies of that code doesn't sound very
good; and we'd need some way of thoroughly testing the one living in
the backend.

> c) The current patch stashes all objects in a list, whenever there's an
> event trigger function. But perhaps some users want to have event
> triggers and not have any use for the CREATE statements. So one idea is
> to require users that want the objects reported to call a special
> function in a ddl_command_start event trigger which enables collection;
> if it's not called, objects are not reported. This eliminates
> performance impact for people not using it, but then maybe it will be
> surprising for people that call the SRF and find that it always returns
> empty.

This seems like premature optimization to me, but I think I lost the
last iteration of this argument.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-11-12 14:33:51 Re: Clang 3.3 Analyzer Results
Previous Message Stephen Frost 2013-11-12 14:28:57 Re: pg_dump and pg_dumpall in real life (proposal)