Re: the s_lock_stuck on perform_spin_delay

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jim Nasby <jim(dot)nasby(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <tmunro(at)postgresql(dot)org>
Subject: Re: the s_lock_stuck on perform_spin_delay
Date: 2024-01-05 02:21:07
Message-ID: 20240105022107.huiznfsgwlbrsrd7@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-01-04 17:03:18 -0600, Jim Nasby wrote:
> On 1/4/24 10:33 AM, Tom Lane wrote:
>
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>
> On Thu, Jan 4, 2024 at 10:22 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> We should be making an effort to ban coding patterns like
> "return with spinlock still held", because they're just too prone
> to errors similar to this one.
>
> I agree that we don't want to add overhead, and also about how
> spinlocks should be used, but I dispute "easily detectable
> statically." I mean, if you or I look at some code that uses a
> spinlock, we'll know whether the pattern that you mention is being
> followed or not, modulo differences of opinion in debatable cases. But
> you and I cannot be there to look at all the code all the time. If we
> had a static checking tool that was run as part of every build, or in
> the buildfarm, or by cfbot, or somewhere else that raised the alarm if
> this rule was violated, then we could claim to be effectively
> enforcing this rule.
>
> I was indeed suggesting that maybe we could find a way to detect
> such things automatically. While I've not been paying close
> attention, I recall there's been some discussions of using LLVM/clang
> infrastructure for customized static analysis, so maybe it'd be
> possible without an undue amount of effort.

I played around with this a while back. One reference with a link to a
playground to experiment with attributes:
https://www.postgresql.org/message-id/20200616233105.sm5bvodo6unigno7%40alap3.anarazel.de

Unfortunately clang's thread safety analysis doesn't handle conditionally
acquired locks, which made it far less promising than I initially thought.

I think there might be some other approaches, but they will all suffer from
not understanding "danger" encountered indirectly, via function calls doing
dangerous things. Which we would like to exclude, but I don't think that's
trivial either.

> FWIW, the lackey[1] tool in Valgrind is able to do some kinds of instruction
> counting, so it might be possible to measure how many instructions are actualyl
> being executed while holding a spinlock. Might be easier than code analysis.

I don't think that's particularly promising. Lackey is *slow*. And it requires
actually reaching problematic states. Consider e.g. the case that was reported
upthread, an lwlock acquired within a spinlock protected section - most of the
time that's not going to result in a lot of cycles, because the lwlock is
free.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-01-05 03:29:31 Re: Synchronizing slots from primary to standby
Previous Message Andy Fan 2024-01-05 02:20:39 Re: the s_lock_stuck on perform_spin_delay