Re: 9.6rc1 Background worker starting multiple times

From: Keith Fiske <keith(at)omniti(dot)com>
To: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: 9.6rc1 Background worker starting multiple times
Date: 2016-12-23 23:21:04
Message-ID: CAG1_KcB7CWGNe_Mt4Q-my3X2Ou_bNEj=LUGB5fePzYHEuNZxdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Dec 23, 2016 at 5:47 PM, Keith Fiske <keith(at)omniti(dot)com> wrote:

>
> On Fri, Dec 23, 2016 at 4:18 PM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>
>>
>> On Tue, Dec 20, 2016 at 12:04 PM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>>
>>>
>>>
>>> -On Tue, Dec 20, 2016 at 10:26 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Dec 20, 2016 at 1:42 AM, Michael Paquier <
>>>> michael(dot)paquier(at)gmail(dot)com> wrote:
>>>>
>>>>> On Tue, Dec 20, 2016 at 6:47 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>>>>> > I'm not quite sure exactly how to debug this further to see what is
>>>>> actually
>>>>> > setting the latch in 9.6.
>>>>>
>>>>> The commit you are complaining about here is likely this one:
>>>>> commit: db0f6cad4884bd4c835156d3a720d9a79dbd63a9
>>>>> author: Robert Haas <rhaas(at)postgresql(dot)org>
>>>>> date: Fri, 9 Oct 2015 14:31:04 -0400
>>>>> Remove set_latch_on_sigusr1 flag.
>>>>>
>>>>> This flag has proven to be a recipe for bugs, and it doesn't seem like
>>>>> it can really buy anything in terms of performance. So let's just
>>>>> *always* set the process latch when we receive SIGUSR1 instead of
>>>>> trying to do it only when needed.
>>>>>
>>>>> So your process gets a SIGUSR1 and its latch is set, causing the loop
>>>>> to repeat... I would discard that as a backend bug, because the
>>>>> removal of set_latch_on_sigusr1 was a good move.
>>>>> --
>>>>> Michael
>>>>>
>>>>
>>>> Anything I can do in the mean time to fix this, or have to wait for a
>>>> bugfix patch?
>>>>
>>>> Keith
>>>>
>>>
>>> So, attached is my naive attempt to try and handle SIGUSR1. Can at least
>>> confirm this is the source of the problem, but I'm unable to figure out how
>>> to work around it. Handling it does in fact stop the background worker from
>>> going nuts anymore. But it also completely stops the maintenance loop from
>>> working properly. It runs once, then never again. It seems to get stuck
>>> after the call to WaitForBackgroundWorkerStartup() and never even gets
>>> to my WaitLatch() call. There is logging that it successfully unregisters
>>> the dynamic BGW, but never seems to go further than that. I tried adding
>>> latch_sigusr1_handler() similar to how the handler in procsignal.c does,
>>> but didn't seem to fix it. With some additional debugging I added, it
>>> doesn't even appear to be calling my custom signal handler function at all.
>>> got_sigusr1 is never set to true.
>>>
>>> The simple act of registering to listen for the SIGUSR1 seems to stop
>>> the original bug. If I comment out just this line in pg_partman_bgw_main()
>>> (the one in pg_partman_bgw_run_maint() seems to make no difference):
>>>
>>> pqsignal(SIGUSR1, pg_partman_bgw_sigusr1);
>>>
>>> it goes back to doing what it was doing before.
>>>
>>> Guessing there's a lot more to SIGUSR1 than I can see right now. But
>>> figured I'd at least try to provide some additional investigation. I
>>> noticed in the BGW code, it mentions using SIGUSR1 on start/stop.
>>>
>>> https://doxygen.postgresql.org/bgworker_8h_source.html#l00099
>>>
>>> Perhaps that's part of the issue when I try and handle it instead?
>>>
>>> partman maintenance timer is set for 10 seconds so the timestamp on the
>>> log entries shows it's not running on time.
>>>
>>> 2016-12-20 11:56:46 EST [] [14486]: [7-1] user=,db=,e=00000 LOG:
>>> database system is ready to accept connections
>>> 2016-12-20 11:56:46 EST [] [14495]: [1-1] user=,db=,e=00000 LOG:
>>> pg_partman master background worker master process initialized with role
>>> keith
>>> 2016-12-20 11:56:46 EST [] [14495]: [2-1] user=,db=,e=00000 DEBUG:
>>> Latch status just before for loop: 1
>>> 2016-12-20 11:56:46 EST [] [14495]: [3-1] user=,db=,e=00000 DEBUG:
>>> Latch status inside for loop before reset: 1
>>> 2016-12-20 11:56:46 EST [] [14495]: [4-1] user=,db=,e=00000 DEBUG:
>>> Latch status just after reset before interupt check: 0
>>> 2016-12-20 11:56:46 EST [] [14495]: [5-1] user=,db=,e=00000 DEBUG:
>>> Latch status just after reset after interupt check: 0
>>> 2016-12-20 11:56:46 EST [] [14495]: [6-1] user=,db=,e=00000 DEBUG:
>>> After sighup check: 0
>>> 2016-12-20 11:56:46 EST [] [14495]: [7-1] user=,db=,e=00000 DEBUG:
>>> After sigterm check: 0
>>> 2016-12-20 11:56:46 EST [] [14495]: [8-1] user=,db=,e=00000 DEBUG:
>>> Before sigusr1 check: 0
>>> 2016-12-20 11:56:46 EST [] [14495]: [9-1] user=,db=,e=00000 DEBUG:
>>> After sigusr1 check: 0
>>> 2016-12-20 11:56:46 EST [] [14495]: [10-1] user=,db=,e=00000 DEBUG:
>>> Dynamic bgw launch begun for keith (0)
>>> 2016-12-20 11:56:46 EST [] [14495]: [11-1] user=,db=,e=00000 DEBUG:
>>> Registering dynamic background worker...
>>> 2016-12-20 11:56:46 EST [] [14495]: [12-1] user=,db=,e=00000 DEBUG:
>>> Waiting for BGW startup...
>>> 2016-12-20 11:56:46 EST [] [14486]: [8-1] user=,db=,e=00000 DEBUG:
>>> registering background worker "pg_partman dynamic background worker
>>> (dbname=keith)"
>>> 2016-12-20 11:56:46 EST [] [14486]: [9-1] user=,db=,e=00000 DEBUG:
>>> starting background worker process "pg_partman dynamic background worker
>>> (dbname=keith)"
>>> 2016-12-20 11:56:46 EST [] [14493]: [1-1] user=,db=,e=00000 LOG:
>>> autovacuum launcher started
>>> 2016-12-20 11:56:46 EST [] [14496]: [1-1] user=,db=,e=00000 DEBUG:
>>> Before parsing dbname GUC in dynamic main func: keith
>>> 2016-12-20 11:56:46 EST [] [14496]: [2-1] user=,db=,e=00000 DEBUG: GUC
>>> rawstring copy: keith
>>> 2016-12-20 11:56:46 EST [] [14496]: [3-1] user=,db=,e=00000 DEBUG:
>>> Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
>>> 2016-12-20 11:56:46 EST [] [14496]: [4-1] user=,db=,e=00000 DEBUG:
>>> Parsing list: keith (0)
>>> 2016-12-20 11:56:46 EST [] [14496]: [5-1] user=,db=,e=00000 DEBUG:
>>> Before run_maint initialize connection for db keith
>>> 2016-12-20 11:56:46 EST [] [14496]: [6-1] user=,db=,e=00000 DEBUG:
>>> After run_maint initialize connection for db keith
>>> 2016-12-20 11:56:46 EST [] [14496]: [7-1] user=,db=,e=00000 DEBUG:
>>> Checking if pg_partman extension is installed in database: keith
>>> 2016-12-20 11:56:46 EST [] [14496]: [8-1] user=,db=,e=00000 LOG:
>>> pg_partman dynamic background worker (dbname=keith) dynamic background
>>> worker initialized with role keith on database keith
>>> 2016-12-20 11:56:46 EST [] [14496]: [9-1] user=,db=,e=00000 LOG:
>>> pg_partman dynamic background worker (dbname=keith): SELECT
>>> "partman".run_maintenance(p_analyze := true, p_jobmon := true) called
>>> by role keith on database keith
>>> 2016-12-20 11:56:46 EST [] [14496]: [10-1] user=,db=,e=00000 DEBUG:
>>> pg_partman dynamic BGW shutting down gracefully for database keith.
>>> 2016-12-20 11:56:46 EST [] [14486]: [10-1] user=,db=,e=00000 DEBUG:
>>> worker process: pg_partman dynamic background worker (dbname=keith) (PID
>>> 14496) exited with exit code 0
>>> 2016-12-20 11:56:46 EST [] [14486]: [11-1] user=,db=,e=00000 DEBUG:
>>> unregistering background worker "pg_partman dynamic background worker
>>> (dbname=keith)"
>>> 2016-12-20 11:57:16 EST [] [14506]: [1-1] user=,db=,e=00000 DEBUG:
>>> autovacuum: processing database "keith"
>>> 2016-12-20 11:57:46 EST [] [14515]: [1-1] user=,db=,e=00000 DEBUG:
>>> autovacuum: processing database "postgres"
>>> 2
>>>
>>>
>> Found this pull request in pg_cron where they suggested using WaitLatch
>> similar to what I was doing in pg_partman.
>>
>> https://github.com/citusdata/pg_cron/pull/15
>>
>> Decided to apply that patch and see if it worked. After fixing the
>> missing WL_TIMEOUT flag, everything was working fine and this extension
>> didn't seem to have the same problem pg_partman was. So, went back to
>> adding some more debugging flags and my suspicion about the use of SIGUSR1
>> on background worker startup/stop seems justified (thanks to whoever added
>> that comment).
>>
>> I noticed the timing of things in the log (see below). The dynamic
>> background worker is unregistered just before the WaitLatch() is called. So
>> perhaps it's the unregistering of the dynamic BGW worker that the main BGW
>> process had called that's setting the latch...
>>
>> Looking through bgworker.c, I found this comment
>>
>> https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/
>> backend/postmaster/bgworker.c#L257
>>
>> Sure enough, I use bgw_notify_pid so my main BGW process can know when
>> my dynamic worker finishes running. And it seems it sends SIGUSR1 to do
>> this notification if I'm reading this right
>>
>> https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/
>> backend/postmaster/bgworker.c#L276
>>
>> So, now because the SIGUSR1 signal always sets the latch in 9.6, this
>> WaitLatch process doesn't seem like it can be used when a BGW itself starts
>> other dynamic background workers and has to rely on them informing it when
>> they're done running.
>>
>> Can anyone confirm/deny what I think I've found here? If this is true,
>> any idea how this can be handled now in 9.6 or is this a bug that needs to
>> be fixed?
>>
>> 2016-12-23 15:31:36 EST [] [30766]: [1-1] user=,db=,e=00000 LOG:
>> pg_partman master background worker master process initialized with role
>> keith
>> 2016-12-23 15:31:36 EST [] [30766]: [3-1] user=,db=,e=00000 DEBUG: Latch
>> status inside for loop before reset: 1
>> 2016-12-23 15:31:36 EST [] [30766]: [4-1] user=,db=,e=00000 DEBUG: Latch
>> status just after reset before interupt check: 0
>> [..]
>> 2016-12-23 15:31:36 EST [] [30766]: [14-1] user=,db=,e=00000 DEBUG:
>> Latch status just after BGW status check: 0
>> 2016-12-23 15:31:36 EST [] [30766]: [15-1] user=,db=,e=00000 DEBUG:
>> Latch status just before wait call: 0
>> [...]
>> 2016-12-23 15:31:36 EST [] [30758]: [11-1] user=,db=,e=00000 DEBUG:
>> unregistering background worker "pg_partman dynamic background worker
>> (dbname=keith)"
>> 2016-12-23 15:31:36 EST [] [30766]: [16-1] user=,db=,e=00000 DEBUG:
>> Latch status after wait call: 1
>> 2016-12-23 15:31:36 EST [] [30766]: [18-1] user=,db=,e=00000 DEBUG:
>> Latch status inside for loop before reset: 1
>> 2016-12-23 15:31:36 EST [] [30766]: [19-1] user=,db=,e=00000 DEBUG:
>> Latch status just after reset before interupt check: 0
>> [...]
>> 2016-12-23 15:31:36 EST [] [30766]: [29-1] user=,db=,e=00000 DEBUG:
>> Latch status just after BGW status check: 0
>> 2016-12-23 15:31:36 EST [] [30766]: [30-1] user=,db=,e=00000 DEBUG:
>> Latch status just before wait call: 0
>> [...]
>> 2016-12-23 15:31:36 EST [] [30758]: [15-1] user=,db=,e=00000 DEBUG:
>> unregistering background worker "pg_partman dynamic background worker
>> (dbname=keith)"
>> 2016-12-23 15:31:36 EST [] [30766]: [31-1] user=,db=,e=00000 DEBUG:
>> Latch status after wait call: 1
>> 2016-12-23 15:31:36 EST [] [30766]: [33-1] user=,db=,e=00000 DEBUG:
>> Latch status inside for loop before reset: 1
>> 2016-12-23 15:31:36 EST [] [30766]: [34-1] user=,db=,e=00000 DEBUG:
>> Latch status just after reset before interupt check: 0
>>
>>
> Just to help confirm my above assumption, commenting out this single line
> of code in my main background worker allowed the WaitLatch timeout to work
> again
>
> worker.bgw_notify_pid = MyProcPid;
>
> Thankfully the only thing my background worker does is call another
> plpgsql function on an interval, and that function has an advisory lock to
> prevent concurrent runs. Don't think this bgw_notify_pid was actually
> preventing concurrent runs before, but if someone actually needed that
> feedback and used a WaitLatch, this could cause it to break. Only reason I
> had this in there was because I used the worker_spi code as my initial
> example on how to create a dynamic background worker.
>
> Keith
>

And I spoke too soon. Taking that line out does indeed stop the background
worker from going nuts. But it also stops it from working completely. It's
basically the same result as me trying to handle SIGUSR1 myself. The main
background worker process halts as soon as the dynamic worker unregisters
because it seems it needs to get that status from the dynamic worker back
to continue working.

So this is quite a big deal for anyone working with background workers that
must interact with each other and use the Latch system to allow proper
interruption of processes and clean shutdowns.

Keith

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2016-12-24 00:19:15 Re: BUG #14475: buffer overflow and segmentation fault
Previous Message fabris 2016-12-23 22:54:31 BUG #14475: buffer overflow and segmentation fault