Quick Links

Re: Issue with the PRNG used by Postgres

From:	Parag Paul <parag(dot)paul(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Issue with the PRNG used by Postgres
Date:	2024-04-10 16:39:45
Message-ID:	CAA=PXp3jBDvx7HwOfeF8OFKZA7WD=ZDA+zdpTARnJaYWu2_2cw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

hi Tom,
Sorry for the delayed response. I was collecting of the data from my
production servers.

The reason why this could be a problem is a flaw in the RNG with the
enlarged Hamming belt.
I attached an image here, with the RNG outputs from 2 backends. I ran our
code for weeks, and collected ther
values generated by the RNG over many backends. The one in Green (say
backend id 600), stopped flapping values and
only produced low (near 0 ) values for half an hour, whereas the Blue(say
backend 700), kept generating good values and had
a range between [0-1)
During this period, the backed 600 suffered and ended up with spinlock
stuck condition.

Extensive analysis here -
https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
-Parag

On Wed, Apr 10, 2024 at 9:28 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Actually ... Parag mentioned that this was specifically about
> lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
> but it does use s_lock.c's delay logic, and I think it's got the
> usage pattern wrong:
>
> while (true)
> {
> /* always try once to acquire lock directly */
> old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
> if (!(old_state & LW_FLAG_LOCKED))
> break; /* got lock */
>
> /* and then spin without atomic operations until lock is released
> */
> {
> SpinDelayStatus delayStatus;
>
> init_local_spin_delay(&delayStatus);
>
> while (old_state & LW_FLAG_LOCKED)
> {
> perform_spin_delay(&delayStatus);
> old_state = pg_atomic_read_u32(&lock->state);
> }
> #ifdef LWLOCK_STATS
> delays += delayStatus.delays;
> #endif
> finish_spin_delay(&delayStatus);
> }
>
> /*
> * Retry. The lock might obviously already be re-acquired by the
> time
> * we're attempting to get it again.
> */
> }
>
> I don't think it's correct to re-initialize the SpinDelayStatus each
> time around the outer loop. That state should persist through the
> entire acquire operation, as it does in a regular spinlock acquire.
> As this stands, it resets the delay to minimum each time around the
> outer loop, and I bet it is that behavior not the RNG that's to blame
> for what he's seeing.
>
> (One should still wonder what is the LWLock usage pattern that is
> causing this spot to become so heavily contended.)
>
> regards, tom lane
>

Attachment	Content-Type	Size
image001.png	image/png	108.2 KB

In response to

Re: Issue with the PRNG used by Postgres at 2024-04-10 16:28:10 from Tom Lane

Responses

Re: Issue with the PRNG used by Postgres at 2024-04-10 16:43:00 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Parag Paul	2024-04-10 16:42:03	Re: Issue with the PRNG used by Postgres
Previous Message	Robert Haas	2024-04-10 16:37:12	Re: Issue with the PRNG used by Postgres