Re: Bug: walsender and high CPU usage

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug: walsender and high CPU usage
Date: 2012-03-12 13:27:09
Message-ID: CAHGQGwG_mG2J7d-M09HkHKFV+XM1iXaTYDQbWhUPSEe89V_cNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 09.03.2012 13:40, Fujii Masao wrote:
>>
>> I found the bug which causes walsender to enter into busy loop
>> when replication connection is terminated. Walsender consumes
>> lots of CPU resource (%sys), and this situation lasts until it has
>> detected the termination of replication connection and exited.
>>
>> The cause of this bug is that the walsender loop doesn't call
>> ResetLatch at all in the above case. Since the latch remains set,
>> the walsender loop cannot sleep on the latch, i.e., WaitLatch
>> always returns immediately.
>>
>> We can fix this bug by adding ResetLatch into the top of the
>> walsender loop. Patch attached.
>>
>> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has
>> already been fixed by the commit
>> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit
>> refactors and refines the walsender loop logic in addition to
>> adding ResetLatch. So I'm tempted to backport this commit
>> (except the deletion of wal_sender_delay) to 9.1 rather than
>> applying the attached patch. OTOH, attached patch is quite simple,
>> and its impact on 9.1 would be very small, so it's easy to backport that.
>> Thought?
>
>
> This patch makes the code that follows bogus:
>
>>                /*
>>                 * If we don't have any pending data in the output buffer,
>> try to send
>>                 * some more.
>>                 */
>>                if (!pq_is_send_pending())
>>                {
>>                        XLogSend(output_message, &caughtup);
>>
>>                        /*
>>                         * Even if we wrote all the WAL that was available
>> when we started
>>                         * sending, more might have arrived while we were
>> sending this
>>                         * batch. We had the latch set while sending, so we
>> have not
>>                         * received any signals from that time. Let's arm
>> the latch again,
>>                         * and after that check that we're still
>> up-to-date.
>>                         */
>>                        if (caughtup && !pq_is_send_pending())
>>                        {
>>                                ResetLatch(&MyWalSnd->latch);
>>
>>                                XLogSend(output_message, &caughtup);
>>                        }
>>                }
>
>
> The comment is no longer valid, and the calls to ResetLatch and XLogSend are
> no longer necessary, once you have the ResetLatch() call at the top of the
> loop.

Right.

> I also think we should backport commit
> cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of
> wal_sender_delay).

Agreed. The attached patch is the same as
cff75130b5f63e45423c2ed90d6f2e84c21ef840,
except for the removal of wal_sender_delay. Could you apply this?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-03-12 13:30:45 Re: Bug: walsender and high CPU usage
Previous Message Alexander Korotkov 2012-03-12 11:50:33 Re: Incorrect behaviour when using a GiST index on points