Re: "ERROR: latch already owned" on gharial

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Sandeep Thakkar <sandeep(dot)thakkar(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, buildfarm-admins(at)lists(dot)postgresql(dot)org, CM Team <cm(at)enterprisedb(dot)com>
Subject: Re: "ERROR: latch already owned" on gharial
Date: 2024-02-08 21:41:14
Message-ID: 20240208214114.cpkib3tnfypjcjau@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-02-08 14:57:47 +0200, Heikki Linnakangas wrote:
> On 08/02/2024 04:08, Soumyadeep Chakraborty wrote:
> > A possible ordering of events:
> >
> > (1) DisownLatch() is called by pid Y during ProcKill() and the write for
> > latch->owner_pid = 0 is NOT yet flushed to shmem.
> >
> > (2) The PGPROC object for pid Y is returned to the free list.
> >
> > (3) Pid X sees the same PGPROC object on the free list and grabs it.
> >
> > (4) Pid X does sanity check inside OwnLatch during InitProcess and
> > still sees the
> > old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.
> >
> > The above sequence of operations should apply to PG HEAD as well.
> >
> > Suggestion:
> >
> > Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
> > ResetLatch(), like the one introduced in [3]? This would ensure that the write
> > latch->owner_pid = 0; is flushed to shmem. The attached patch does this.
>
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen.

Right. I wonder if the issue istead could be something similar to what was
fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
through proc_exit() for the same process, you can get all kinds of weird
mixed up resource ownership. The bug fixed in 8fb13dd6ab5b wouldn't apply,
but it's pretty easy to introduce similar bugs in other places, so it seems
quite plausible that greenplum might have done so. We also did have more
proc_exit()s in signal handlers in older branches, so it might just be an
issue that also was present before.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Maiquel Grassi 2024-02-08 22:28:29 RE: Psql meta-command conninfo+
Previous Message Andres Freund 2024-02-08 21:33:12 Re: Where can I find the doxyfile?