Re: the s_lock_stuck on perform_spin_delay

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <tmunro(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: the s_lock_stuck on perform_spin_delay
Date: 2024-01-08 19:57:34
Message-ID: CA+TgmoabLxQCbnatgdcO_VZKQtZqnj4c0zg3ovavqBZBh+QYcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 7, 2024 at 9:52 PM Andy Fan <zhihuifan1213(at)163(dot)com> wrote:
> > I think we should add cassert-only infrastructure tracking whether we
> > currently hold spinlocks, are in a signal handler and perhaps a few other
> > states. That'd allow us to add assertions like:
> ..
> > - no lwlocks or ... while in signal handlers
>
> I *wish* lwlocks should *not* be held while in signal handlers since it
> inspired me for a direction of a low-frequency internal bug where a
> backend acuqire a LWLock when it has acuqired it before. However when I
> read more document and code, I am thinking this should not be a
> problem.

It's not safe to acquire an LWLock in a signal handler unless we know
that the code that the signal handler is interrupting can't already be
doing that. Consider this code from LWLockAcquire:

/* Add lock to list of locks held by this backend */
held_lwlocks[num_held_lwlocks].lock = lock;
held_lwlocks[num_held_lwlocks++].mode = mode;

Imagine that we're executing this code and we get to the point where
we've set held_lwlocks[num_held_lwlocks].lock = lock and
held_lwlock[num_held_lwlocks].mode = mode, but we haven't yet
incremented num_held_lwlocks. Then a signal arrives and we jump into
the signal handler, which also calls LWLockAcquire(). Hopefully you
can see that the new lock and mode will be written over the old one,
and now held_lwlocks[] is corrupted. Oops.

Because the PostgreSQL code relies extensively on global variables, it
has problems like this all over the place. Another example is the
error-reporting infrastructure. ereport(yadda yadda...) fills out a
bunch of global variables before really going and doing anything. If a
signal handler were to fire and start another ereport(...), chaos
would ensue, for similar reasons as here.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-01-08 20:00:01 Re: Removing unneeded self joins
Previous Message Dean Rasheed 2024-01-08 19:36:34 Re: Emitting JSON to file using COPY TO