Re: Issue with the PRNG used by Postgres

From: Parag Paul <parag(dot)paul(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Issue with the PRNG used by Postgres
Date: 2024-04-10 16:39:45
Message-ID: CAA=PXp3jBDvx7HwOfeF8OFKZA7WD=ZDA+zdpTARnJaYWu2_2cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

hi Tom,
Sorry for the delayed response. I was collecting of the data from my
production servers.

The reason why this could be a problem is a flaw in the RNG with the
enlarged Hamming belt.
I attached an image here, with the RNG outputs from 2 backends. I ran our
code for weeks, and collected ther
values generated by the RNG over many backends. The one in Green (say
backend id 600), stopped flapping values and
only produced low (near 0 ) values for half an hour, whereas the Blue(say
backend 700), kept generating good values and had
a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlock
stuck condition.

Extensive analysis here -
https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
-Parag

On Wed, Apr 10, 2024 at 9:28 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Actually ... Parag mentioned that this was specifically about
> lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
> but it does use s_lock.c's delay logic, and I think it's got the
> usage pattern wrong:
>
> while (true)
> {
> /* always try once to acquire lock directly */
> old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
> if (!(old_state & LW_FLAG_LOCKED))
> break; /* got lock */
>
> /* and then spin without atomic operations until lock is released
> */
> {
> SpinDelayStatus delayStatus;
>
> init_local_spin_delay(&delayStatus);
>
> while (old_state & LW_FLAG_LOCKED)
> {
> perform_spin_delay(&delayStatus);
> old_state = pg_atomic_read_u32(&lock->state);
> }
> #ifdef LWLOCK_STATS
> delays += delayStatus.delays;
> #endif
> finish_spin_delay(&delayStatus);
> }
>
> /*
> * Retry. The lock might obviously already be re-acquired by the
> time
> * we're attempting to get it again.
> */
> }
>
> I don't think it's correct to re-initialize the SpinDelayStatus each
> time around the outer loop. That state should persist through the
> entire acquire operation, as it does in a regular spinlock acquire.
> As this stands, it resets the delay to minimum each time around the
> outer loop, and I bet it is that behavior not the RNG that's to blame
> for what he's seeing.
>
> (One should still wonder what is the LWLock usage pattern that is
> causing this spot to become so heavily contended.)
>
> regards, tom lane
>

Attachment Content-Type Size
image001.png image/png 108.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Parag Paul 2024-04-10 16:42:03 Re: Issue with the PRNG used by Postgres
Previous Message Robert Haas 2024-04-10 16:37:12 Re: Issue with the PRNG used by Postgres