Re: [HACKERS] Think we need major revisions in async.c...

From: Massimo Dal Zotto <dz(at)cs(dot)unitn(dot)it>
To: hackers(at)postgreSQL(dot)org (PostgreSQL Hackers)
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane)
Subject: Re: [HACKERS] Think we need major revisions in async.c...
Date: 1998-09-28 16:39:47
Message-ID: 199809281639.SAA01310@boogie.cs.unitn.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> Well, I've spent the last several hours studying how NOTIFY works, and
> I think it's a mess :-(. We need some significant revisions in there.

I agreee. I tried to improve it but it's really a mess.

> I have not been able to trace an exact flow of control that would
> yield the NOTIFY-message-interspersed-in-data result that I saw this
> afternoon. However, I am pretty certain that I know the basic cause
> of the problem. The risky aspect of NOTIFY is that very substantial
> amounts of processing (an entire transaction, in fact) are launched from
> a signal handler routine. As a rule of thumb, doing more than setting a
> flag variable in a signal handler is a big NO NO, because you don't have
> any idea what bit of code you've interrupted or what you can safely
> call. (I have a strong suspicion, in fact, that the behavior I saw
> arose from the signal interrupt occuring during the backend's fflush()

I suspected this could happen but don't really know how signals are handled.

> of data destined for the frontend at the end of a query cycle. I dunno
> how re-entrant the stdio library is on my machine, but it seems quite
> possible that it behaves undesirably if one attempts to queue more
> output data from a routine that interrupts fflush()...)
>
> The Postgres code tries to protect itself against this sort of problem
> by having the SIGUSR2 handler not do anything except set a flag if the
> current transaction state says that a transaction is in progress. This
> is almost good enough, but evidently not quite. The first issue is that
> some query initialization and finalization code in PostgresMain happens
> outside the StartTransactionCommand...CommitTransactionCommand zone of
> safety. (In particular, the final fflush of data bound for the
> frontend.) The second issue is that async.c itself sometimes starts and
> ends transactions, by calling xact.c, and yet xact.c also calls async.c
> as part of transaction commit. At least two levels of recursion are
> possible in there. I have not convinced myself whether the bug I

I believe that the transaction code can avoid this type of recursion.

> encountered is due to some subtly incorrect ordering of events in that
> interaction, or whether it's just due to non-reentrancy in stdio.
> But I no longer care. This code is demonstrably buggy, and it is too
> complex to understand or maintain anyway. I propose to rewrite it.
>
> I propose that:
>
> 1. The link between inbound-notify processing and transaction
> start/commit should be removed. Having inbound-notify execution called
> from transaction commit is too complicated because that notify execution
> itself may have to induce a transaction start/commit. Instead, inbound
> notify execution ought to be triggered from the PostgresMain main loop.

The reason why it is done at commit is that it is the best place where we
can know if the transaction completed or not. If we move the notify code
in the main loop we must also keep some flag which tell us if and when the
notify can be done.

> This has the additional advantage that we can have more control over
> just which parts of the main loop allow or disallow notify execution
> directly from the signal handler. The main loop should now contain
> something like:
>
> for (;;) {
>
> ReadyForQuery(); // contains fflush() call
>
> AllowNotifyInterrupt();
>
> ReadCommand();
>
> DisableNotifyInterrupt();
>
> execute command, including transaction start/commit
> }
>
> so that the SIGUSR2 interrupt is allowed to do something serious
> only in the *very* tiny window where we are just waiting for or
> reading the next frontend query. (Since the interrupt has no
> reason to touch the state of stdin, this should be safe enough.)
> An interrupt occuring any other time will just set a flag, which
> will be read by AllowNotifyInterrupt --- if an interrupt has occured
> since the last DisableNotifyInterrupt, then AllowNotifyInterrupt will
> carry out the necessary processing. Thus, all inbound-notify execution
> effectively happens at the outer loop --- either while waiting for
> frontend input, or just before doing so.
>
>
> 2. The innards of NOTIFY execution are unnecessarily complicated and
> inefficient, quite aside from the above problem of controlling interrupt
> safety. The NOTIFY statement itself (routine Async_Notify) makes one
> pass over the pg_listener relation, writing a 1 into the notification
> field of each tuple with a matching relname. Then at commit time, we
> have to make another pass over the pg_listener relation to issue signals
> (and/or send a message to our own front end, if we're NOTIFYing
> something we also LISTEN on, which is pretty common in my applications
> anyway). This is unnecessary processing; worse, if the NOTIFY is done
> early in a long transaction block, we end up holding a write lock on
> pg_listen for a long time. Since pg_listen is a global resource, this
> can effectively bring all the rest of the backends to a halt!

This is the standard behaviour of the notify code. If you set the flag
notifyunlock in .../data/pg_options the write lock is released immediately
after updating the table, but I suspect this is the reason of the
duplicate oids. For this reason I left the flag disabled by default, and
maybe this explains why you don't get duplicate oids in 6.4, but I have
enabled it in a production environment without problems at all.

> The right way to do this is for NOTIFY to simply add the target relation
> name to a list of pending notifies (as indeed it does already). Then
> when we are ready to commit, we scan pg_listen *once*, performing tuple
> updates as needed and also sending signals and/or frontend messages.
> (Note this is for an *outbound* notify, which we do want to have happen
> during transaction commit. The real problem with the current async.c is
> that it tries to combine outbound and inbound notify code. Keeping them

The real problem is that inbound notifiy can happen at any moment from the
signal handler.

> separate will be better.) This approach takes less processing, and it
> also means that we need not do anything so risky as
> RelationUnsetLockForWrite() to gain performance. The pg_listen lock
> will be held for only a very short time before we release it as a normal
> part of commit. Moreover, a self-notify need never modify the pg_listen
> table at all, saving cycles and opening the door for another feature
> (see below).

But in general you must still get the write lock to prevent deadlocks.

> If the transaction is aborted rather than committed, we need only
> discard the list of pending notify names. pg_listen is never touched
> in that case.

This sounds reasonable.

> 3. While I am at it, I would like to add a feature that I've been
> wanting for a while. It seems to me that the be_pid field provided in a
> PGnotify message ought not always be your own backend's PID --- what use
> is that? What it ought to be is the PID of the *signaling* backend, or
> at least of one of the signaling backends other than your own. I have
> a bunch of applications that all read and write the same tables and let
> each other know about updates by NOTIFY messages. When one of these
> apps gets a NOTIFY, it has to re-read the table *even if the notify was
> the one it just issued*, and the updates in question are the ones it
> just wrote out. This is a waste of cycles, but right now there is no
> choice because you cannot tell whether the NOTIFY you get is due
> (solely) to your own NOTIFY or (perhaps in part) to someone else's.

The problem is that notify has the same semantic of unix signals. You can
have more notifies for the same table from many backends but you only know
that at least one has happened.
What you propose should be some sort of message queuing between backends
in which each message is notified separately to the other backend In this
case it would also be useful to pass some arbitrary text to the listening
backend.

> What I'd like to do is to have cross-backend notifies write the
> originating backend's PID into the "notification" field, rather than
> just a "1". The receiving backend can then send that value to its
> frontend, as opposed to sending its own PID as it will do in the
> self-notify case. In this way the frontend can unambiguously
> distinguish its own NOTIFY bouncing back from a NOTIFY from another
> client. (If several other clients send NOTIFY at about the same time,
> you may get only one message with only one of their PIDs, since there
> is only room for one originating PID in pg_listen. But it will be kept
> distinct from any message resulting from a NOTIFY of your own.) Note
> that we already made the backend's own PID available to the frontend as
> part of the CANCEL-related changes, so that part of the info is there
> for free.
>
> A more radical solution would be to suppress self-notifies altogether,
> but that is a change in semantics that would likely break some existing
> apps. I will be satisfied if the frontend can tell the difference
> between a self-notify and an incoming notify.

I wouldn't change this.

> Any objections? If there's something crucial I've missed in all this,
> better let me know ASAP!
>
> I know that it's a tad late in the beta cycle to be making such significant
> changes, but my applications really need NOTIFY to work and work reliably.
> We have to have this *now*. I hope y'all will bear with me.
>
> regards, tom lane
>

Massimo Dal Zotto

+----------------------------------------------------------------------+
| Massimo Dal Zotto email: dz(at)cs(dot)unitn(dot)it |
| Via Marconi, 141 phone: ++39-461-534251 |
| 38057 Pergine Valsugana (TN) www: http://www.cs.unitn.it/~dz/ |
| Italy pgp: finger dz(at)tango(dot)cs(dot)unitn(dot)it |
+----------------------------------------------------------------------+

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Massimo Dal Zotto 1998-09-28 16:46:58 Re: [HACKERS] It sorta works, but I'm confused about locking
Previous Message Tom Lane 1998-09-28 14:28:26 Proper cleanup at backend exit