Re: pg_basebackup blocking all queries with horrible performance

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Lonni J Friedman <netllama(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Jerry Sievers <gsievers19(at)comcast(dot)net>, pgsql-admin(at)postgresql(dot)org
Subject: Re: pg_basebackup blocking all queries with horrible performance
Date: 2012-06-12 18:39:23
Message-ID: CABUevEzcJNNRHQNn=USd9McPShLuR4UT41ycKQJG6356ifti5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

On Tue, Jun 12, 2012 at 8:37 PM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:
> On Tue, Jun 12, 2012 at 10:49 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Tue, Jun 12, 2012 at 2:37 AM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:
>>> On Fri, Jun 8, 2012 at 7:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> On Sat, Jun 9, 2012 at 4:30 AM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:
>>>>> On Thu, Jun 7, 2012 at 11:04 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:
>>>>>> On 06/08/2012 09:01 AM, Lonni J Friedman wrote:
>>>>>>>
>>>>>>> On Thu, Jun 7, 2012 at 5:07 PM, Jerry Sievers<gsievers19(at)comcast(dot)net>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> You might try stopping pg_basebackup in place with SIGSTOP and check
>>>>>>>>
>>>>>>>> if problem goes away.  SIGCONT and you should  start having
>>>>>>>> sluggishness again.
>>>>>>>>
>>>>>>>> If verified, then any sort of throttling mechanism should work.
>>>>>>>
>>>>>>>
>>>>>>> I'm certain that the problem is triggered only when pg_basebackup is
>>>>>>> running.  Its very predictable, and goes away as soon as pg_basebackup
>>>>>>> finishes running.  What do you mean by a throttling mechanism?
>>>>>>
>>>>>>
>>>>>> Sure, it only happens when pg_basebackup is running. But if you *pause*
>>>>>> pg_basebackup, so it's still running but not currently doing work, does the
>>>>>> problem go away? Does it come back when you unpause pg_basebackup? That's
>>>>>> what Jerry was telling you to try.
>>>>>>
>>>>>> If the problem goes away when you pause pg_basebackup and comes back when
>>>>>> you unpause it, it's probably a system load problem.
>>>>>>
>>>>>> If it doesn't go away, it's more likely to be a locking issue or something
>>>>>> _other_ than simple load.
>>>>>>
>>>>>> SIGSTOP ("kill -STOP") pauses a process, and SIGCONT ("kill -CONT") resumes
>>>>>> it, so on Linux you can use these to try and find out. When you SIGSTOP
>>>>>> pg_basebackup then the postgres backend associated with it should block
>>>>>> shortly afterwards as its buffers fill up and it can't send more data, so
>>>>>> the load should come off the server.
>>>>>>
>>>>>> A "throttling mechanism" refers to anything that limits the rate or speed of
>>>>>> a thing. In this case, what you want to do if your problem is system
>>>>>> overload is to limit the speed at which pg_basebackup does its work so other
>>>>>> things can still get work done. In other words you want to throttle it.
>>>>>> Typical throttling mechanisms include the "ionice" and "renice" commands to
>>>>>> change I/O and CPU priority, respectively.
>>>>>>
>>>>>> Note that you may need to change the priority of the *backend* that
>>>>>> pg_basebackup is using, not necessarily the pg_basebackup command its self.
>>>>>> I haven't done enough with Pg's replication to know how that works, so
>>>>>> someone else will have to fill that bit in.
>>>>>
>>>>> Thanks for your reply.  I've confirmed that issuing a SIGSTOP does
>>>>> eliminate the thrashing, and issuing a SIGCONT resumes the thrash.
>>>>>
>>>>> I've looked at iostat output both before & during pg_basebackup runs,
>>>>> and I'm not seeing any indication that the problem is due to disk IO
>>>>> bottlenecks.  The numbers don't vary very much at all between the good
>>>>> & bad times.  This is typical when pg_basebackup is running:
>>>>> ########
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> md0
>>>>>                  0.00     0.00   67.76   68.62     4.42     1.46
>>>>> 88.34     0.00    0.00    0.00    0.00   0.00   0.00
>>>>> ########
>>>>>
>>>>> and this is when the system is ok:
>>>>> ########
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> md0
>>>>>                  0.00     0.00   68.04   68.56     4.44     1.46
>>>>> 88.39     0.00    0.00    0.00    0.00   0.00   0.00
>>>>> ########
>>>>>
>>>>>
>>>>> I looked at vmstat output, but nothing is jumping out at me as being
>>>>> dramatically different when pg_basebackup is running.  swap in and
>>>>> swap out are zero 100% of the time for the good & bad perf cases.  I
>>>>> can post example output if someone is interested, or if there's
>>>>> something specific that I should be looking at as a potential problem,
>>>>> let me know.
>>>>
>>>> Did you set synchronous_standby_names to '*'? If so, the problem you
>>>> encountered can happen.
>>>>
>>>> When synchronous_standby_names is '*', you cannot control which
>>>> standbys take a role of synchronous standby. The standby which you
>>>> expect to run as asynchronous one might be synchronous one. So
>>>> my guess is that at first one of your three standbys was running as
>>>> synchronous standby, and all queries were executed normally. But
>>>> when you started pg_basebackup, pg_basebackup unexpectedly
>>>> got the role of synchronous standby from another standby. Since
>>>> pg_basebackup doesn't send the information about replication
>>>> progress back to the master, all queries (more precisely, transaction
>>>> commit) got stuck, and kept waiting for the reply from synchronous
>>>> standby.
>>>>
>>>> You can avoid this problem by setting synchronous_standby_names
>>>> to the names of your standbys instead of '*'.
>>>
>>> I don't have synchronous_standby_names set at all.  I'm only doing
>>> asynchronous replication.
>>
>> Hmm... I have no idea about what happened on your environment, for now.
>> Could you show me the self-contained test case?
>
> I'm running the following, which gets piped over ssh to a remote
> server (at gigabit ethernet speed):
> pg_basebackup -v -D - -x -Ft -U postgres
>
> One thing that I've discovered is that if I throttle back the speed of
> what is getting piped to the remote server, that directly correlates
> to the load on the server.

That seems to indicate that you're overloading the I/O system... Or
the CPU, but more likely I/O.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Peter Cheung 2012-06-13 00:24:47 How to install Postgresql with GSSAPI support using One click installer?
Previous Message Lonni J Friedman 2012-06-12 18:37:42 Re: pg_basebackup blocking all queries with horrible performance

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-06-12 18:40:09 Re: Re: [COMMITTERS] pgsql: Run pgindent on 9.2 source tree in preparation for first 9.3
Previous Message Lonni J Friedman 2012-06-12 18:37:42 Re: pg_basebackup blocking all queries with horrible performance