Re: pg11.1: dsa_area could not attach to segment

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg11.1: dsa_area could not attach to segment
Date: 2019-02-07 03:31:39
Message-ID: CAEepm=20TBrkCZmK9Vi-5r-OAHdygAN0NqHn-uCb51hiZP+9rA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 7, 2019 at 12:47 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> However I *did* reproduce the error in an isolated, non-production postgres
> instance. It's a total empty, untuned v11.1 initdb just for this, running ONLY
> a few simultaneous loops around just one query It looks like the simultaneous
> loops sometimes (but not always) fail together. This has happened a couple
> times.
>
> It looks like one query failed due to "could not attach" in leader, one failed
> due to same in worker, and one failed with "not pinned", which I hadn't seen
> before and appears to be related to DSM, not DSA...

Hmm. I hadn't considered that angle... Some kind of interference
between unrelated DSA areas, or other DSM activity? I will also try
to repro that here...

> I'm also trying to reproduce on other production servers. But so far nothing
> else has shown the bug, including the other server which hit our original
> (other) DSA error with the queued_alters query. So I tentatively think there
> really may be something specific to the server (not the hypervisor so maybe the
> OS, libraries, kernel, scheduler, ??).

Initially I thought these might be two symptoms of the same corruption
but I'm now starting to wonder if there are two bugs here: "could not
allocate %d pages" (rare) might be a logic bug in the computation of
contiguous_pages that requires a particular allocation pattern to hit,
and "dsa_area could not attach to segment" (rarissimo) might be
something else requiring concurrency/a race.

One thing that might be useful would be to add a call to
dsa_dump(area) just before the errors are raised, which will write a
bunch of stuff out to stderr and might give us some clues. And to
print out the variable "index" from get_segment_by_index() when it
fails. I'm also going to try to work up some better assertions.
--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-02-07 03:46:09 Re: Documentation and code don't agree about partitioned table UPDATEs
Previous Message Andres Freund 2019-02-07 02:05:31 Re: Undo logs