Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: buschmann(at)nidsa(dot)net, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date: 2019-02-20 02:21:05
Message-ID: CA+hUKGKpQJCWcgyy3QTC9vdn6uKAR_8r__A-MMm2GYfj45caag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Feb 19, 2019 at 7:31 AM PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference: 15641
> Logged by: Hans Buschmann
> Email address: buschmann(at)nidsa(dot)net
> PostgreSQL version: 11.2
> Operating system: Windows Server 2019 Standard
> Description:
>
> I recently moved a production system from PG 10.7 to 11.2 on a different
> Server.
>
> The configuration settings where mostly taken from the old system and
> enhanced by new features of PG 11.
>
> pg_prewarm was used for a long time (with no specific configuration).
>
> Now I have added Huge page support for Windows in the OS and verified it
> with vmmap tool from Sysinternals to be active.
> (the shared buffers are locked in memory: Lock_WS is set).
>
> When pg_prewarm.autoprewarm is set to on (using the default after initial
> database import via pg_restore), the autoprewarm worker process
> terminates immediately and generates a huge number of logfile entries
> like:
>
> CPS PRD 2019-02-17 16:11:53 CET 00000 11:> LOG: background worker
> "autoprewarm worker" (PID 3996) exited with exit code 1
> CPS PRD 2019-02-17 16:11:53 CET 55000 1:> ERROR: could not map dynamic
> shared memory segment

Hmm. It's not clear to me how using large pages for the main
PostgreSQL shared memory region could have any impact on autoprewarm's
entirely separate DSM segment. I wonder if other DSM use cases are
impacted. Does parallel query work? For example, the following
produces a parallel query that uses a few DSM segments:

create table foo as select generate_series(1, 1000000)::int i;
analyze foo;
explain analyze select count(*) from foo f1 join foo f2 using (i);

Looking at the place where that error occurs, it seems like it simply
failed to find the handle, as if it didn't exist at all at the time
dsm_attach() was called. I'm not entirely sure how that could happen
just because you turned on huge pages. Is it possible that there is a
race where apw_load_buffers() manages to detach before the worker
attached, and the timing changes? At a glance, that shouldn't happen
because apw_start_database_worker() waits for the work to exit before
returning.

I think we'll need one of our Windows-enabled hackers to take a look.

PS Sorry for breaking the thread. I wish our archives app had a
"[re]send me this email" button, for people who subscribed after the
message was sent...

--
Thomas Munro
https://enterprisedb.com

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Etsuro Fujita 2019-02-20 02:30:25 Re: BUG #15642: UPDATE statements that change a partition key and FDW partitions problem.
Previous Message Michael Paquier 2019-02-20 02:09:11 Re: Segmentation Fault in logical decoding get/peek API

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-02-20 02:41:04 Re: Reaping Temp tables to avoid XID wraparound
Previous Message Amit Langote 2019-02-20 01:57:29 Re: Another way to fix inherited UPDATE/DELETE