Re: WIP fix proposal for bug #6123

From: Florian Pflug <fgp(at)phlo(dot)org>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP fix proposal for bug #6123
Date: 2011-08-02 17:52:27
Message-ID: 37E71EFA-6A6E-40CD-B9D6-CC08B37B697A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Aug2, 2011, at 17:03 , Kevin Grittner wrote:
> Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> First, I'm not sure this is even a bug. To me, it seems that
>> postgres currently resolves an inherently ambiguous situation in
>> one possible way, while you expect it to pick another. It might be
>> that the behaviour that you suggest is more in line with people's
>> expectations (More on that below), but that still doesn't make the
>> existing behaviour wrong.
>
> That point would be a hard sell for me. To silently turn a BEFORE
> trigger into an INSTEAD OF trigger invites data corruption, in the
> sense that business rules that you thought were unconditionally
> enforced in triggers can be broken with no obvious hint that it
> happened.

Hm, I can see your point. But I still maintain that a trigger who
acts based on values of OLD which don't reflect the actual tuple
being deleted is an equally dangerous source of data corruption.

>> There might very well be situations in which the current behaviour
>> is preferable over the behaviour you suggest, and maybe these
>> situations turn out to be much more common than anyone anticipates
>> right now.
>
> My imagination is not up to the task of putting any plausible
> situations forward where current behavior is a good thing. Of
> course I realize that's not proof of anything, but it leaves me
> skeptical.

One situation I had in mind was a table whose records contain a
kind of reference count. A cleanup process might then issue a DELETE
which removes all entries whose reference count is zero. Now, if such
a table contains a BEFORE DELETE trigger which, through some elaborate
chains of triggers, manages to increment some row's reference count
that was initially zero, it'd be a bad idea for the DELETE to remove it.

But my actual point was not so much that *I* have a reasonable use-case
in mind in which the current behaviour is preferable, but more that
*someone* might.

>> But if we apply the patch now, the chance of that being discovered
>> before 9.1 is released is virtually zero. And after the release
>> there's no going back, so we might end up changing the behaviour
>> to the worse.
>
> Perhaps throwing an error on the DELETE case as well as the UPDATE
> case would address this concern.

It addresses all concerns except this one, I'm afraid. Whatever we
change the behaviour to, I feel that we need to give ourselves plenty
of time to tweak it, should problems arise. Which just isn't possible
before 9.1 is released. Or at least that my opinion.

> That would have saved weeks of
> staff time here when we started seeing the mysterious behavior which
> currently exists. The only way we finally got to the bottom of it
> was to step through execution in gdb; those users not equipped to do
> so might have bigger problems than we did.

Ouch.

>> And lastly, I believe that your ultimate goal of guaranteeing that
>> a row is actually deleted (updated) after a BEFORE DELETE (BEFORE
>> UPDATE) trigger has fired is actually insurmountable anyway, at
>> least for isolation level READ COMMITTED.
>
> Hmm. That's a very potent point. I had not considered READ
> COMMITTED because we have never, ever used it under PostgreSQL.
> Obviously, any solution has to work for those who *do* use it, too;
> and the behavior between different isolation levels can't be too
> convoluted. So, as you say, the *ultimate* goal may be
> unattainable.

After reading the code again, I take that back. We *do* seem to lock
the tuple before firing BEFORE {UPDATE,DELETE} triggers - trigger.c's
GetTupleForTrigger() takes care of that. I previous missed that because
I only looked at ExecUpdate() and ExecDelete() in nodeModifyTable.c.
Sorry for that.

>> and I do believe that we ought to improve things - but not by
>> replacing one unsatisfactory behaviour with another.
>
> I hope you're not suggesting *that's* my goal! ;-)

Nah, more of a case of bad wording on my part...

>> The root of the issue seems to be that you're trying to do things
>> in a BEFORE trigger that really belong into an AFTER trigger.
>
> I'll take that as a *collective* pronoun, since I haven't personally
> written any of the thousands of triggers. There are 20 programmers
> doing that. Part of the problem is that they sometimes put things
> in a BEFORE trigger that belong in an AFTER trigger. I caught one
> place they were doing an UPDATE of the target row in a BEFORE
> trigger rather than setting the values in the NEW record. My patch
> makes the latter throw an error, as I believe it should, rather than
> silently leaving the data in a bad state.

>> If you could explain why using an AFTER trigger doesn't work for
>> you, maybe we could improve them to be suitable for your purposes
>> (and please forgive me if you already did that).
>
> I did to some extent, but really perhaps the biggest issue (which I
> don't think I've really covered earlier in the thread) is to not
> silently corrupt the data. I would settle for throwing an error on
> the DELETE case as well as the UPDATE case, because my data would
> then be protected, and programmers would be forced to work around
> the issue in a way that PostgreSQL can handle correctly.

Hm, OK I see your point now I believe. This is not so much about
wanting to put things into BEFORe triggers which doesn't really
fit there, but instead avoiding weeks of debugging if someones
messes up.

> In my case it is mainly an issue in deletes (and the special case of
> a "purge" process where deletes are not normally allowed) doing a
> depth-first search for dependent records, and deleting from the
> bottom up. In some cases there is redundant information in higher
> level tables based on primary data in lower level tables -- in the
> form of counts, sums, or status codes. If you delete a case, and
> that "cascades" to a event history table which has a trigger which
> updates case status according to certain business rules, the delete
> of the case is canceled because of the delete of a status-changing
> event. That's very bad. If we at least threw an error instead, we
> could identify the problem reliably, and code around it somehow.
> That might be by moving the delete of dependent records to a point
> after the parent record has already been deleted, but that runs the
> risk of triggering other validation failures based on business rules
> in the triggers, because the triggers would then be seeing a state
> we try very hard to prevent.

If we go with my suggestion below (which entails throwing the error
earlier at the time of the offending update or delete), you could
get the non-throwing behaviour you want by protecting UPDATES and
DELETES which might touch rows which are already in the process of being
updated or deleted by EXCEPTION blocks.

>> And equally firm do I expect to be the "no" if you ask "do you
>> expect the row that is actually deleted and the contents of the
>> variable OLD in the delete trigger to differ".
>
> I'm less certain about that, especially if you phrase it in terms of
> denormalization optimizations -- the redundant data in the form of
> counts, sums, status codes, or sometimes just copied data from child
> records stored in parents to support better performance on common
> queries. That's the main issue for me,

Yeah, but the problem is we cannot guarantee that all the "important"
fields will match ;-)

> Would you feel comfortable with a patch which threw an error on the
> DELETE case, as it does on the UPDATE case?

Yes, though with a little additional twist. The twist being that I'd
like the error to be thrown earlier, at the time of the offending
UPDATE or DELETE, not later, when the original UPDATE or DELETE which
caused the BEFORE trigger invocation stumbles over the updated row.
It not only seems more logical that way, but it also makes it possible
for the trigger to catch the error, and react accordingly.

For example, if a BEFORE trigger on table A modifies table B, and
one of B's BEFORE triggers in turn modifies A, B's BEFORE trigger
could then catch the error and e.g. decide to skip the UPDATE.

Implementation-wise, I image we'd set a flag in the tuple header
after locking the tuple (as I now discovered we do, sorry again)
but before firing BEFORE triggers. We'd clear the flag again
once all BEFORE triggers have run, and let the actual UPDATE
or DELETE proceed. If an UPDATE or DELETE encounters a flagged
tuple, and the transaction (or one of its parents) is the current
lock holder, we'd throw an error. To clean up after aborted
transactions, we'd clear the flag upon acquiring a tuple lock.

Instead of a flag in the tuple header, we could also keep
a (backend-local) list of ctids for which we've fired BEFORE
triggers but haven't done the actual modification yet.

We might also want to make it possible to distinguish pending
UPDATES from pending DELETES (i.e., choose a different error code for
the two cases), because it seems reasonable that code would want
to react differently in those cases (e.g., skip an UPDATE if there's
a pending DELETE).

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2011-08-02 18:06:47 Re: [RFC] Common object property boards
Previous Message Heikki Linnakangas 2011-08-02 17:13:53 Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)