AW: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x

From: "Hans Buschmann" <buschmann(at)nidsa(dot)net>
To: "Mithun Cy" <mithun(dot)cy(at)gmail(dot)com>, "Mithun Cy" <mithun(dot)cy(at)enterprisedb(dot)com>, <thomas(dot)munro(at)gmail(dot)com>
Cc: <pgsql-bugs(at)lists(dot)postgresql(dot)org>, <robertmhaas(at)gmail(dot)com>
Subject: AW: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date: 2019-02-24 14:04:09
Message-ID: D2B9F2A20670C84685EF7D183F2949E202569F21@gigant.nidsa.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On the weekend, I did some more investigations:

It seems that Huge pages are NOT the cause of this problem.

The problem is only reproducable ONCE, after a database restart it disappears.

By reinstalling the original pg_pasebackup on another test VM the problem reappeared once.

Here is the start of the error log:

CPS PRD 2019-02-24 12:11:57 CET 00000 1:> LOG: database system was interrupted; last known up at 2019-02-17 16:14:05 CET
CPS PRD 2019-02-24 12:12:16 CET 00000 2:> LOG: entering standby mode
CPS PRD 2019-02-24 12:12:16 CET 00000 3:> LOG: redo starts at 0/23000028
CPS PRD 2019-02-24 12:12:16 CET 00000 4:> LOG: consistent recovery state reached at 0/23000168
CPS PRD 2019-02-24 12:12:16 CET 00000 5:> LOG: invalid record length at 0/24000060: wanted 24, got 0
CPS PRD 2019-02-24 12:12:16 CET 00000 9:> LOG: database system is ready to accept read only connections
CPS PRD 2019-02-24 12:12:16 CET 3D000 1:> FATAL: database 16384 does not exist
CPS PRD 2019-02-24 12:12:16 CET 00000 10:> LOG: background worker "autoprewarm worker" (PID 3968) exited with exit code 1
CPS PRD 2019-02-24 12:12:16 CET 00000 1:> LOG: autoprewarm successfully prewarmed 0 of 12402 previously-loaded blocks
CPS PRD 2019-02-24 12:12:17 CET XX000 1:> FATAL: could not connect to the primary server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.27.155", user "replicator", SSL off
CPS PRD 2019-02-24 12:12:17 CET 55000 1:> ERROR: could not map dynamic shared memory segment
CPS PRD 2019-02-24 12:12:17 CET 00000 11:> LOG: background worker "autoprewarm worker" (PID 3296) exited with exit code 1
CPS PRD 2019-02-24 12:12:17 CET XX000 1:> FATAL: could not connect to the primary server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.27.155", user "replicator", SSL off
CPS PRD 2019-02-24 12:12:17 CET 55000 1:> ERROR: could not map dynamic shared memory segment
CPS PRD 2019-02-24 12:12:17 CET 00000 12:> LOG: background worker "autoprewarm worker" (PID 2756) exited with exit code 1
CPS PRD 2019-02-24 12:12:17 CET 55000 1:> ERROR: could not map dynamic shared memory segment
...
(PS: the correct replication function was not set, so causing the errors concerning replication)

It seems that an outdated autoprewarm.blocks causes the problem.

After a restart the autoprewarm.blocks file seems to be rewritten, so that the next start gives no error.

For a test, I copied the erroneus autoprewarm.blocks files over to the data section and the problem reappeared.

The autoprewarm.blocks file is not corrupted or moved around manually but rather a leftover from the preceding test installation.

On this instance I had installed a copy of the production database under 11.2.
By doing the production switch, I dropped the test database and pg_restored the current one.

This left the previous autoprewarm.blocks file in the data directory.

On the first start the autoprewarm files does not match the newly restored database (perhpas the cause of the fatal error: database 16384 does not exist)

So the problem lies in the initial detection of the autoprewarm.blocks file.

This seems easy to reproduce:

- Install/create a database with autoprewarm on and pg_prewarm loaded.
- Fill the autoprewarm cache with some data
- pg_dump the database
- drop the database
- create the database and pg_restore it from the dump
- start the instance and logs are flooded

I have taken no further investigation in the sourcecode due to limited skills so far...

Thanks

Hans Buschmann

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mithun Cy 2019-02-24 18:40:49 Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Previous Message Thomas Munro 2019-02-24 11:09:53 Re: BUG #15636: PostgreSQL 11.1 pg_basebackup backup to a CIFS destination throws fsync error at end of backup

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2019-02-24 16:09:50 Re: Bloom index cost model seems to be wrong
Previous Message Andrey Borodin 2019-02-24 12:37:09 Re: amcheck verification for GiST