Re: the s_lock_stuck on perform_spin_delay

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <tmunro(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: the s_lock_stuck on perform_spin_delay
Date: 2024-01-10 13:03:28
Message-ID: CAEze2WggP-2Dhocmdhp-LxBzic=MXRgGA_tmv1G_9n-PDt2MQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 10 Jan 2024 at 02:44, Andy Fan <zhihuifan1213(at)163(dot)com> wrote:
> Hi,
>
> I want to know if Andres or you have plan
> to do some code review. I don't expect this would happen very soon, just
> want to make sure this will not happen that both of you think the other
> one will do, but actually none of them does it in fact. a commit fest
> [1] has been added for this.

> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -5419,6 +5419,7 @@ LockBufHdr(BufferDesc *desc)
> perform_spin_delay(&delayStatus);
> }
> finish_spin_delay(&delayStatus);
> + START_SPIN_LOCK();
> return old_buf_state | BM_LOCKED;
> }

I think that we need to 'arm' the checks just before we lock the spin
lock, and 'disarm' the checks just after we unlock the spin lock,
rather than after and before, respectively. That way, we won't have a
chance of false negatives: with your current patch it is possible that
an interrupt fires between the acquisition of the lock and the code in
START_SPIN_LOCK() marking the thread as holding a spin lock, which
would cause any check in that signal handler to incorrectly read that
we don't hold any spin locks.

> +++ b/src/backend/storage/lmgr/lock.c
> @@ -776,6 +776,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
> bool found_conflict;
> bool log_lock = false;
>
> + Assert(SpinLockCount == 0);
> +

I'm not 100% sure on the policy of this, but theoretically you could
use LockAquireExtended(dontWait=true) while holding a spin lock, as
that would not have an unknown duration. Then again, this function
also does elog/ereport, which would cause issues, still, so this code
may be the better option.

> + elog(PANIC, "stuck spinlock detected at %s, %s:%d after waiting for %u ms",
> + func, file, line, delay_ms);

pg_usleep doesn't actually guarantee that we'll wait for exactly that
duration; depending on signals received while spinning and/or OS
scheduling decisions it may be off by orders of magnitude.

> +++ b/src/common/scram-common.c

This is unrelated to the main patchset.

> +++ b/src/include/storage/spin.h

Minor: I think these changes could better be included in miscadmin, or
at least the definition for SpinLockCount should be moved there: The
spin lock system itself shouldn't be needed in places where we need to
make sure that we don't hold any spinlocks, and miscadmin.h already
holds things related to "System interrupt and critical section
handling", which seems quite related.

Kind regards,

Matthias van de Meent

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-01-10 13:07:29 Re: A failure in t/038_save_logical_slots_shutdown.pl
Previous Message Nazir Bilal Yavuz 2024-01-10 12:59:24 Re: Show WAL write and fsync stats in pg_stat_io