the s_lock_stuck on perform_spin_delay

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Thomas Munro <tmunro(at)postgresql(dot)org>
Subject: the s_lock_stuck on perform_spin_delay
Date: 2024-01-04 06:59:06
Message-ID: 87plyhvama.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi,

from src/backend/storage/lmgr/README:

"""
Spinlocks. These are intended for *very* short-term locks. If a lock
is to be held more than a few dozen instructions, or across any sort of
kernel call (or even a call to a nontrivial subroutine), don't use a
spinlock. Spinlocks are primarily used as infrastructure for lightweight
locks.
"""

I totally agree with this and IIUC spin lock is usually used with the
following functions.

#define init_local_spin_delay(status) ..
void perform_spin_delay(SpinDelayStatus *status);
void finish_spin_delay(SpinDelayStatus *status);

During the perform_spin_delay, we have the following codes:

void
perform_spin_delay(SpinDelayStatus *status)

/* Block the process every spins_per_delay tries */
if (++(status->spins) >= spins_per_delay)
{
if (++(status->delays) > NUM_DELAYS)
s_lock_stuck(status->file, status->line, status->func);

the s_lock_stuck will PAINC the entire system.

My question is if someone doesn't obey the rule by mistake (everyone
can make mistake), shall we PANIC on a production environment? IMO I
think it can be a WARNING on a production environment and be a stuck
when 'ifdef USE_ASSERT_CHECKING'.

People may think spin lock may consume too much CPU, but it is not true
in the discussed scene since perform_spin_delay have pg_usleep in it,
and the MAX_DELAY_USEC is 1 second and MIN_DELAY_USEC is 0.001s.

I notice this issue actually because of the patch "Cache relation
sizes?" from Thomas Munro [1], where the latest patch[2] still have the
following code.
+ sr = smgr_alloc_sr(); <-- HERE a spin lock is hold
+
+ /* Upgrade to exclusive lock so we can create a mapping. */
+ LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
operation is needed. it may take a long time.

Our internal testing system found more misuses on our own PG version.

I think a experienced engineer like Thomas can make this mistake and the
patch was reviewed by 3 peoples, the bug is still there. It is not easy
to say just don't do it.

the attached code show the prototype in my mind. Any feedback is welcome.

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJg%2BgqCs0dgo94L%3D1J9pDp5hKkotji9A05k2nhYQhF4%2Bw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/attachment/123659/v5-0001-WIP-Track-relation-sizes-in-shared-memory.patch

--
Best Regards
Andy Fan

Attachment Content-Type Size
v1-0001-Don-t-call-s_lock_stuck-in-production-environment.patch text/x-diff 1.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2024-01-04 07:00:05 Re: speed up a logical replica setup
Previous Message Peter Smith 2024-01-04 06:53:44 Re: GUC names in messages