Re: Issue with the PRNG used by Postgres

From: Parag Paul <parag(dot)paul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Issue with the PRNG used by Postgres
Date: 2024-04-10 16:48:42
Message-ID: CAA=PXp1fVKwZmJevL5o_j_i+zNAEjr=t+rHBK_O_=fw8bN7-cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Yes, the probability of this happening is astronomical, but in production
with 128 core servers with 7000 max_connections, with petabyte scale data,
this did repro 2 times in the last month. We had to move to a local
approach to manager our ratelimiting counters.
This is not reproducible very easily. I feel that we should at least shield
ourselves with the following change, so that we at least increase the delay
by 1000us every time. We will follow a linear back off, but better than no
backoff.
status->cur_delay += max(1000, (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) +
0.5));

On Wed, Apr 10, 2024 at 9:43 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Apr 10, 2024 at 12:40 PM Parag Paul <parag(dot)paul(at)gmail(dot)com> wrote:
> > The reason why this could be a problem is a flaw in the RNG with the
> enlarged Hamming belt.
> > I attached an image here, with the RNG outputs from 2 backends. I ran
> our code for weeks, and collected ther
> > values generated by the RNG over many backends. The one in Green (say
> backend id 600), stopped flapping values and
> > only produced low (near 0 ) values for half an hour, whereas the
> Blue(say backend 700), kept generating good values and had
> > a range between [0-1)
> > During this period, the backed 600 suffered and ended up with spinlock
> stuck condition.
>
> This is a very vague description of a test procedure. If you provide a
> reproducible series of steps that causes a stuck spinlock, I imagine
> everyone will be on board with doing something about it. But this
> graph is not going to convince anyone of anything.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-04-10 16:52:36 Re: Table AM Interface Enhancements
Previous Message Robert Haas 2024-04-10 16:43:00 Re: Issue with the PRNG used by Postgres