Re: pg_basebackup blocking all queries with horrible performance

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Lonni J Friedman <netllama(at)gmail(dot)com>
Cc: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Jerry Sievers <gsievers19(at)comcast(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-admin(at)postgresql(dot)org
Subject: Re: pg_basebackup blocking all queries with horrible performance
Date: 2012-06-09 02:29:36
Message-ID: CAHGQGwERsw_mmXcEktbkSC01cUs3-SXfQbNq5y5JDbMe8B=9RA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

On Sat, Jun 9, 2012 at 4:30 AM, Lonni J Friedman <netllama(at)gmail(dot)com> wrote:
> On Thu, Jun 7, 2012 at 11:04 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:
>> On 06/08/2012 09:01 AM, Lonni J Friedman wrote:
>>>
>>> On Thu, Jun 7, 2012 at 5:07 PM, Jerry Sievers<gsievers19(at)comcast(dot)net>
>>>  wrote:
>>>>
>>>> You might try stopping pg_basebackup in place with SIGSTOP and check
>>>>
>>>> if problem goes away.  SIGCONT and you should  start having
>>>> sluggishness again.
>>>>
>>>> If verified, then any sort of throttling mechanism should work.
>>>
>>>
>>> I'm certain that the problem is triggered only when pg_basebackup is
>>> running.  Its very predictable, and goes away as soon as pg_basebackup
>>> finishes running.  What do you mean by a throttling mechanism?
>>
>>
>> Sure, it only happens when pg_basebackup is running. But if you *pause*
>> pg_basebackup, so it's still running but not currently doing work, does the
>> problem go away? Does it come back when you unpause pg_basebackup? That's
>> what Jerry was telling you to try.
>>
>> If the problem goes away when you pause pg_basebackup and comes back when
>> you unpause it, it's probably a system load problem.
>>
>> If it doesn't go away, it's more likely to be a locking issue or something
>> _other_ than simple load.
>>
>> SIGSTOP ("kill -STOP") pauses a process, and SIGCONT ("kill -CONT") resumes
>> it, so on Linux you can use these to try and find out. When you SIGSTOP
>> pg_basebackup then the postgres backend associated with it should block
>> shortly afterwards as its buffers fill up and it can't send more data, so
>> the load should come off the server.
>>
>> A "throttling mechanism" refers to anything that limits the rate or speed of
>> a thing. In this case, what you want to do if your problem is system
>> overload is to limit the speed at which pg_basebackup does its work so other
>> things can still get work done. In other words you want to throttle it.
>> Typical throttling mechanisms include the "ionice" and "renice" commands to
>> change I/O and CPU priority, respectively.
>>
>> Note that you may need to change the priority of the *backend* that
>> pg_basebackup is using, not necessarily the pg_basebackup command its self.
>> I haven't done enough with Pg's replication to know how that works, so
>> someone else will have to fill that bit in.
>
> Thanks for your reply.  I've confirmed that issuing a SIGSTOP does
> eliminate the thrashing, and issuing a SIGCONT resumes the thrash.
>
> I've looked at iostat output both before & during pg_basebackup runs,
> and I'm not seeing any indication that the problem is due to disk IO
> bottlenecks.  The numbers don't vary very much at all between the good
> & bad times.  This is typical when pg_basebackup is running:
> ########
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> md0
>                  0.00     0.00   67.76   68.62     4.42     1.46
> 88.34     0.00    0.00    0.00    0.00   0.00   0.00
> ########
>
> and this is when the system is ok:
> ########
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> md0
>                  0.00     0.00   68.04   68.56     4.44     1.46
> 88.39     0.00    0.00    0.00    0.00   0.00   0.00
> ########
>
>
> I looked at vmstat output, but nothing is jumping out at me as being
> dramatically different when pg_basebackup is running.  swap in and
> swap out are zero 100% of the time for the good & bad perf cases.  I
> can post example output if someone is interested, or if there's
> something specific that I should be looking at as a potential problem,
> let me know.

Did you set synchronous_standby_names to '*'? If so, the problem you
encountered can happen.

When synchronous_standby_names is '*', you cannot control which
standbys take a role of synchronous standby. The standby which you
expect to run as asynchronous one might be synchronous one. So
my guess is that at first one of your three standbys was running as
synchronous standby, and all queries were executed normally. But
when you started pg_basebackup, pg_basebackup unexpectedly
got the role of synchronous standby from another standby. Since
pg_basebackup doesn't send the information about replication
progress back to the master, all queries (more precisely, transaction
commit) got stuck, and kept waiting for the reply from synchronous
standby.

You can avoid this problem by setting synchronous_standby_names
to the names of your standbys instead of '*'.

This seems a bug. I think we should prevent pg_basebackup from
becoming synchronous standby. Thought?

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Marlowe 2012-06-09 06:53:35 Re: pg_basebackup blocking all queries with horrible performance
Previous Message Igor Shmain 2012-06-09 02:21:11 Re: Data split -- Creating a copy of database without outage

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Reykja 2012-06-09 04:24:55 Re: New Postgres committer: Kevin Grittner
Previous Message Fujii Masao 2012-06-09 01:47:59 Re: New Postgres committer: Kevin Grittner