Re: 9.6rc1 Background worker starting multiple times

From: Keith Fiske <keith(at)omniti(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.6rc1 Background worker starting multiple times
Date: 2016-12-19 21:47:00
Message-ID: CAG1_KcA==oju1cYnN36_ZWZ4CbmkeM485qRVDPzogPYoA=HfjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Sep 27, 2016 at 11:31 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:

>
>
>
> On Thu, Sep 22, 2016 at 4:54 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
>
>> On Thu, Sep 15, 2016 at 6:17 PM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>> > Running into an issue with pg_partman's (version 2.6.0) background
>> worker
>> > process with 9.6rc1.
>> >
>> > https://github.com/keithf4/pg_partman/tree/master/test/test_bgw
>> >
>> > Noticed the issue when a user ran the pgtap tests against it and I'm
>> seeing
>> > the same thing as well. It's working fine with 9.4 and 9.5. Looking in
>> the
>> > postgresql logs it seems to be trying to start the initial background
>> worker
>> > process multiple times and then one of them fails to start properly
>> which
>> > then causes issues when maintenance tries to run. I've attached the
>> > postgresql log file from when 9.5 starts up
>> > (95_postgresql-2016-09-15_180306.log) and things are working normally
>> and
>> > another log file from 9.6 (96_postgresql-2016-09-15_180131.log) that
>> shows
>> > the odd startup behavior
>>
>> I don't know what's happening here exactly, but I think your statement
>> that it is trying to start the same background worker multiple times
>> is not accurate. There are a series of log entries like this:
>>
>> 2016-09-15 18:03:06 EDT [] [7688]: [8-1] user=,db=,e=00000 DEBUG:
>> registering background worker "pg_partman dynamic background worker
>> (dbname=keith)"
>> 2016-09-15 18:03:06 EDT [] [7688]: [9-1] user=,db=,e=00000 DEBUG:
>> starting background worker process "pg_partman dynamic background
>> worker (dbname=keith)"
>> 2016-09-15 18:03:06 EDT [] [7694]: [1-1] user=,db=,e=00000 LOG:
>> autovacuum launcher started
>> 2016-09-15 18:03:06 EDT [] [7697]: [1-1] user=,db=,e=00000 DEBUG:
>> Before parsing dbname GUC in dynamic main func: keith
>> 2016-09-15 18:03:06 EDT [] [7697]: [2-1] user=,db=,e=00000 DEBUG: GUC
>> rawstring copy: keith
>> 2016-09-15 18:03:06 EDT [] [7697]: [3-1] user=,db=,e=00000 DEBUG:
>> Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
>> 2016-09-15 18:03:06 EDT [] [7697]: [4-1] user=,db=,e=00000 DEBUG:
>> Parsing list: keith (0)
>> 2016-09-15 18:03:06 EDT [] [7697]: [5-1] user=,db=,e=00000 DEBUG:
>> Before run_maint initialize connection for db keith
>> 2016-09-15 18:03:06 EDT [] [7697]: [6-1] user=,db=,e=00000 DEBUG:
>> After run_maint initialize connection for db keith
>> 2016-09-15 18:03:06 EDT [] [7697]: [7-1] user=,db=,e=00000 DEBUG:
>> Checking if pg_partman extension is installed in database: keith
>> 2016-09-15 18:03:06 EDT [] [7697]: [8-1] user=,db=,e=00000 LOG:
>> pg_partman dynamic background worker (dbname=keith) dynamic background
>> worker initialized with role keith on database keith
>> 2016-09-15 18:03:06 EDT [] [7697]: [9-1] user=,db=,e=00000 LOG:
>> pg_partman dynamic background worker (dbname=keith): SELECT
>> partman.run_maintenance(p_analyze := true, p_jobmon := true) called by
>> role keith on database keith
>> 2016-09-15 18:03:06 EDT [] [7697]: [10-1] user=,db=,e=00000 DEBUG:
>> pg_partman dynamic BGW shutting down gracefully for database keith.
>> 2016-09-15 18:03:06 EDT [] [7688]: [10-1] user=,db=,e=00000 DEBUG:
>> worker process: pg_partman dynamic background worker (dbname=keith)
>> (PID 7697) exited with exit code 0
>> 2016-09-15 18:03:06 EDT [] [7688]: [11-1] user=,db=,e=00000 DEBUG:
>> unregistering background worker "pg_partman dynamic background worker
>> (dbname=keith)"
>>
>> Note that the worker is registered (by something) at the beginning of
>> that sequence and then unregistered (because it exited with code 0) at
>> the end of that sequence. So the trouble isn't that the bgworker
>> system itself is restarting the worker, but that your code keeps
>> reregistering it. I can't guess why your code behaves differently on
>> 9.6 than on 9.5 and prior off the top of my head, but you might want
>> to try to figure that out. I don't rule out the possibility that
>> there's some bug in 9.6 that is tickling your code to behave
>> differently, but I don't think I want to try to go analyze your code
>> and figure out what the issue is either...
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
> I was mistaken about it starting multiple times being the actual problem.
> I thought the master process was trying to start more than once, but that's
> working normally. It's the dynamic portion that's starting multiple times,
> and that's normal, along with the unregistering of it. The master
> background worker process starts up a new dynamic background worker for
> each database configured to run pg_partman.
>
> So, someone forked pg_partman and submitted a patch that I thought may be
> relevant here. Didn't fix things, but may have made the issue clearer.
>
> https://github.com/keithf4/pg_partman/pull/140/
>
> The unclosed if statement was causing a "continue" to always be run which
> seemed to be causing the eventual conflict of advisory locks at some point.
> Removing the "continue" caused the conflict to no longer occur, but what
> now happens is that background worker loop executes extremely fast with no
> "sleep" state. It seems the latch process to cause the main background
> worker to wait the given number of seconds between creating the dynamic
> one(s) is not working. I didn't write this latch wait process myself but
> took it from the worker_spi example that comes with the pg source. I tested
> the 9.4 worker_spi I based it on in 9.6, though, and it seems to work fine.
> The worker_spi has changed a small bit of code in 9.5, but the change it
> incorporates doesn't seem to have been backpatched to 9.4, which I'm trying
> to stay compatible with.
>
> Right now, it seems the latch wait is no longer working. I'll continue
> looking myself and I've also asked the person that submitted that patch if
> he may be able to look as well. Appreciate your time!
>
> Keith
>

So trying to dig into this more myself, I added some debugging lines in the
attached copy of the bgw file to check the status of whether the latch is
getting set/reset properly. You don't even have to configure any partition
sets to see this happen, just load the library in postgresql.conf and set a
time, database & user with the partman GUCs. Below is from the log file
when running 9.6. You can see that something is setting the Latch when the
Wait is called causing it to immediately continue to the next loop with no
timeout.

2016-12-13 16:16:31 EST [] [6750]: [2-1] user=,db=,e=00000 DEBUG: Latch
status just before for loop: 1
2016-12-13 16:16:31 EST [] [6750]: [3-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1
2016-12-13 16:16:31 EST [] [6750]: [4-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [5-1] user=,db=,e=00000 DEBUG: Latch
status just after reset after interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [6-1] user=,db=,e=00000 DEBUG: Dynamic
bgw launch begun for keith (0)
2016-12-13 16:16:31 EST [] [6741]: [8-1] user=,db=,e=00000 DEBUG:
registering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:16:31 EST [] [6741]: [9-1] user=,db=,e=00000 DEBUG: starting
background worker process "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:16:31 EST [] [6750]: [7-1] user=,db=,e=00000 DEBUG: Latch
status just before wait call: 0
2016-12-13 16:16:31 EST [] [6747]: [1-1] user=,db=,e=00000 LOG: autovacuum
launcher started
2016-12-13 16:16:31 EST [] [6751]: [1-1] user=,db=,e=00000 DEBUG: Before
parsing dbname GUC in dynamic main func: keith
2016-12-13 16:16:31 EST [] [6751]: [2-1] user=,db=,e=00000 DEBUG: GUC
rawstring copy: keith
2016-12-13 16:16:31 EST [] [6751]: [3-1] user=,db=,e=00000 DEBUG: Entered
foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:16:31 EST [] [6751]: [4-1] user=,db=,e=00000 DEBUG: Parsing
list: keith (0)
2016-12-13 16:16:31 EST [] [6751]: [5-1] user=,db=,e=00000 DEBUG: Before
run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6751]: [6-1] user=,db=,e=00000 DEBUG: After
run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6751]: [7-1] user=,db=,e=00000 DEBUG: Checking
if pg_partman extension is installed in database: keith
2016-12-13 16:16:31 EST [] [6751]: [8-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith) dynamic background worker
initialized with role keith on database keith
2016-12-13 16:16:31 EST [] [6751]: [9-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith): SELECT
"partman".run_maintenance(p_analyze := true, p_jobmon := true) called by
role keith on database keith
2016-12-13 16:16:31 EST [] [6751]: [10-1] user=,db=,e=00000 DEBUG:
pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:16:31 EST [] [6741]: [10-1] user=,db=,e=00000 DEBUG: worker
process: pg_partman dynamic background worker (dbname=keith) (PID 6751)
exited with exit code 0
2016-12-13 16:16:31 EST [] [6741]: [11-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:16:31 EST [] [6750]: [8-1] user=,db=,e=00000 DEBUG: Latch
status after wait call: 1
2016-12-13 16:16:31 EST [] [6750]: [9-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1
2016-12-13 16:16:31 EST [] [6750]: [10-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [11-1] user=,db=,e=00000 DEBUG: Latch
status just after reset after interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [12-1] user=,db=,e=00000 DEBUG: Dynamic
bgw launch begun for keith (0)
2016-12-13 16:16:31 EST [] [6741]: [12-1] user=,db=,e=00000 DEBUG:
registering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:16:31 EST [] [6741]: [13-1] user=,db=,e=00000 DEBUG:
starting background worker process "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:16:31 EST [] [6752]: [1-1] user=,db=,e=00000 DEBUG: Before
parsing dbname GUC in dynamic main func: keith
2016-12-13 16:16:31 EST [] [6752]: [2-1] user=,db=,e=00000 DEBUG: GUC
rawstring copy: keith
2016-12-13 16:16:31 EST [] [6752]: [3-1] user=,db=,e=00000 DEBUG: Entered
foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:16:31 EST [] [6752]: [4-1] user=,db=,e=00000 DEBUG: Parsing
list: keith (0)
2016-12-13 16:16:31 EST [] [6752]: [5-1] user=,db=,e=00000 DEBUG: Before
run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6752]: [6-1] user=,db=,e=00000 DEBUG: After
run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6752]: [7-1] user=,db=,e=00000 DEBUG: Checking
if pg_partman extension is installed in database: keith
2016-12-13 16:16:31 EST [] [6752]: [8-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith) dynamic background worker
initialized with role keith on database keith
2016-12-13 16:16:31 EST [] [6750]: [13-1] user=,db=,e=00000 DEBUG: Latch
status just before wait call: 0
2016-12-13 16:16:31 EST [] [6752]: [9-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith): SELECT
"partman".run_maintenance(p_analyze := true, p_jobmon := true) called by
role keith on database keith
2016-12-13 16:16:31 EST [] [6752]: [10-1] user=,db=,e=00000 DEBUG:
pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:16:31 EST [] [6741]: [14-1] user=,db=,e=00000 DEBUG: worker
process: pg_partman dynamic background worker (dbname=keith) (PID 6752)
exited with exit code 0
2016-12-13 16:16:31 EST [] [6741]: [15-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:16:31 EST [] [6750]: [14-1] user=,db=,e=00000 DEBUG: Latch
status after wait call: 1
2016-12-13 16:16:31 EST [] [6750]: [15-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1

However, when running this in 9.5, you can see the latch is never set
during the Wait (except when the bgw is first started) and the WaitLatch()
is able to use the timeout properly (currently set for 10 seconds)

2016-12-13 16:22:58 EST [] [7743]: [2-1] user=,db=,e=00000 DEBUG: Latch
status just before for loop: 1
2016-12-13 16:22:58 EST [] [7743]: [3-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 1
2016-12-13 16:22:58 EST [] [7743]: [4-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0
2016-12-13 16:22:58 EST [] [7743]: [5-1] user=,db=,e=00000 DEBUG: Latch
status just after reset after interupt check: 0
2016-12-13 16:22:58 EST [] [7743]: [6-1] user=,db=,e=00000 DEBUG: Dynamic
bgw launch begun for keith (0)
2016-12-13 16:22:58 EST [] [7735]: [8-1] user=,db=,e=00000 DEBUG:
registering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:22:58 EST [] [7735]: [9-1] user=,db=,e=00000 DEBUG: starting
background worker process "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:22:58 EST [] [7743]: [7-1] user=,db=,e=00000 DEBUG: Latch
status just before wait call: 0
2016-12-13 16:22:58 EST [] [7741]: [1-1] user=,db=,e=00000 LOG: autovacuum
launcher started
2016-12-13 16:22:58 EST [] [7744]: [1-1] user=,db=,e=00000 DEBUG: Before
parsing dbname GUC in dynamic main func: keith
2016-12-13 16:22:58 EST [] [7744]: [2-1] user=,db=,e=00000 DEBUG: GUC
rawstring copy: keith
2016-12-13 16:22:58 EST [] [7744]: [3-1] user=,db=,e=00000 DEBUG: Entered
foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:22:58 EST [] [7744]: [4-1] user=,db=,e=00000 DEBUG: Parsing
list: keith (0)
2016-12-13 16:22:58 EST [] [7744]: [5-1] user=,db=,e=00000 DEBUG: Before
run_maint initialize connection for db keith
2016-12-13 16:22:58 EST [] [7744]: [6-1] user=,db=,e=00000 DEBUG: After
run_maint initialize connection for db keith
2016-12-13 16:22:58 EST [] [7744]: [7-1] user=,db=,e=00000 DEBUG: Checking
if pg_partman extension is installed in database: keith
2016-12-13 16:22:58 EST [] [7744]: [8-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith) dynamic background worker
initialized with role keith on database keith
2016-12-13 16:22:58 EST [] [7744]: [9-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith): SELECT
"partman".run_maintenance(p_analyze := true, p_jobmon := true) called by
role keith on database keith
2016-12-13 16:22:58 EST [] [7744]: [10-1] user=,db=,e=00000 DEBUG:
pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:22:58 EST [] [7735]: [10-1] user=,db=,e=00000 DEBUG: worker
process: pg_partman dynamic background worker (dbname=keith) (PID 7744)
exited with exit code 0
2016-12-13 16:22:58 EST [] [7735]: [11-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:23:08 EST [] [7743]: [8-1] user=,db=,e=00000 DEBUG: Latch
status after wait call: 0
2016-12-13 16:23:08 EST [] [7743]: [9-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 0
2016-12-13 16:23:08 EST [] [7743]: [10-1] user=,db=,e=00000 DEBUG: Latch
status just after reset before interupt check: 0
2016-12-13 16:23:08 EST [] [7743]: [11-1] user=,db=,e=00000 DEBUG: Latch
status just after reset after interupt check: 0
2016-12-13 16:23:08 EST [] [7743]: [12-1] user=,db=,e=00000 DEBUG: Dynamic
bgw launch begun for keith (0)
2016-12-13 16:23:08 EST [] [7735]: [12-1] user=,db=,e=00000 DEBUG:
registering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:23:08 EST [] [7735]: [13-1] user=,db=,e=00000 DEBUG:
starting background worker process "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:23:08 EST [] [7754]: [1-1] user=,db=,e=00000 DEBUG: Before
parsing dbname GUC in dynamic main func: keith
2016-12-13 16:23:08 EST [] [7754]: [2-1] user=,db=,e=00000 DEBUG: GUC
rawstring copy: keith
2016-12-13 16:23:08 EST [] [7754]: [3-1] user=,db=,e=00000 DEBUG: Entered
foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:23:08 EST [] [7754]: [4-1] user=,db=,e=00000 DEBUG: Parsing
list: keith (0)
2016-12-13 16:23:08 EST [] [7754]: [5-1] user=,db=,e=00000 DEBUG: Before
run_maint initialize connection for db keith
2016-12-13 16:23:08 EST [] [7754]: [6-1] user=,db=,e=00000 DEBUG: After
run_maint initialize connection for db keith
2016-12-13 16:23:08 EST [] [7754]: [7-1] user=,db=,e=00000 DEBUG: Checking
if pg_partman extension is installed in database: keith
2016-12-13 16:23:08 EST [] [7743]: [13-1] user=,db=,e=00000 DEBUG: Latch
status just before wait call: 0
2016-12-13 16:23:08 EST [] [7754]: [8-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith) dynamic background worker
initialized with role keith on database keith
2016-12-13 16:23:08 EST [] [7754]: [9-1] user=,db=,e=00000 LOG: pg_partman
dynamic background worker (dbname=keith): SELECT
"partman".run_maintenance(p_analyze := true, p_jobmon := true) called by
role keith on database keith
2016-12-13 16:23:08 EST [] [7754]: [10-1] user=,db=,e=00000 DEBUG:
pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:23:08 EST [] [7735]: [14-1] user=,db=,e=00000 DEBUG: worker
process: pg_partman dynamic background worker (dbname=keith) (PID 7754)
exited with exit code 0
2016-12-13 16:23:08 EST [] [7735]: [15-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-12-13 16:23:18 EST [] [7743]: [14-1] user=,db=,e=00000 DEBUG: Latch
status after wait call: 0
2016-12-13 16:23:18 EST [] [7743]: [15-1] user=,db=,e=00000 DEBUG: Latch
status inside for loop before reset: 0

I'm not quite sure exactly how to debug this further to see what is
actually setting the latch in 9.6.

Attachment Content-Type Size
pg_partman_bgw.c text/x-csrc 17.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2016-12-20 06:42:23 Re: 9.6rc1 Background worker starting multiple times
Previous Message Tom Lane 2016-12-19 21:22:24 Re: Crash with a CUBE query on 9.6