From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: pg11.1: dsa_area could not attach to segment |
Date: | 2019-02-07 03:31:39 |
Message-ID: | CAEepm=20TBrkCZmK9Vi-5r-OAHdygAN0NqHn-uCb51hiZP+9rA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Feb 7, 2019 at 12:47 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> However I *did* reproduce the error in an isolated, non-production postgres
> instance. It's a total empty, untuned v11.1 initdb just for this, running ONLY
> a few simultaneous loops around just one query It looks like the simultaneous
> loops sometimes (but not always) fail together. This has happened a couple
> times.
>
> It looks like one query failed due to "could not attach" in leader, one failed
> due to same in worker, and one failed with "not pinned", which I hadn't seen
> before and appears to be related to DSM, not DSA...
Hmm. I hadn't considered that angle... Some kind of interference
between unrelated DSA areas, or other DSM activity? I will also try
to repro that here...
> I'm also trying to reproduce on other production servers. But so far nothing
> else has shown the bug, including the other server which hit our original
> (other) DSA error with the queued_alters query. So I tentatively think there
> really may be something specific to the server (not the hypervisor so maybe the
> OS, libraries, kernel, scheduler, ??).
Initially I thought these might be two symptoms of the same corruption
but I'm now starting to wonder if there are two bugs here: "could not
allocate %d pages" (rare) might be a logic bug in the computation of
contiguous_pages that requires a particular allocation pattern to hit,
and "dsa_area could not attach to segment" (rarissimo) might be
something else requiring concurrency/a race.
One thing that might be useful would be to add a call to
dsa_dump(area) just before the errors are raised, which will write a
bunch of stuff out to stderr and might give us some clues. And to
print out the variable "index" from get_segment_by_index() when it
fails. I'm also going to try to work up some better assertions.
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2019-02-07 03:46:09 | Re: Documentation and code don't agree about partitioned table UPDATEs |
Previous Message | Andres Freund | 2019-02-07 02:05:31 | Re: Undo logs |