Re: Bug: walsender and high CPU usage

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug: walsender and high CPU usage
Date: 2012-03-12 10:58:51
Message-ID: 4F5DD6EB.5050309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09.03.2012 13:40, Fujii Masao wrote:
> I found the bug which causes walsender to enter into busy loop
> when replication connection is terminated. Walsender consumes
> lots of CPU resource (%sys), and this situation lasts until it has
> detected the termination of replication connection and exited.
>
> The cause of this bug is that the walsender loop doesn't call
> ResetLatch at all in the above case. Since the latch remains set,
> the walsender loop cannot sleep on the latch, i.e., WaitLatch
> always returns immediately.
>
> We can fix this bug by adding ResetLatch into the top of the
> walsender loop. Patch attached.
>
> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has
> already been fixed by the commit
> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit
> refactors and refines the walsender loop logic in addition to
> adding ResetLatch. So I'm tempted to backport this commit
> (except the deletion of wal_sender_delay) to 9.1 rather than
> applying the attached patch. OTOH, attached patch is quite simple,
> and its impact on 9.1 would be very small, so it's easy to backport that.
> Thought?

This patch makes the code that follows bogus:

> /*
> * If we don't have any pending data in the output buffer, try to send
> * some more.
> */
> if (!pq_is_send_pending())
> {
> XLogSend(output_message, &caughtup);
>
> /*
> * Even if we wrote all the WAL that was available when we started
> * sending, more might have arrived while we were sending this
> * batch. We had the latch set while sending, so we have not
> * received any signals from that time. Let's arm the latch again,
> * and after that check that we're still up-to-date.
> */
> if (caughtup && !pq_is_send_pending())
> {
> ResetLatch(&MyWalSnd->latch);
>
> XLogSend(output_message, &caughtup);
> }
> }

The comment is no longer valid, and the calls to ResetLatch and XLogSend
are no longer necessary, once you have the ResetLatch() call at the top
of the loop.

I also think we should backport commit
cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of
wal_sender_delay).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2012-03-12 11:50:33 Re: Incorrect behaviour when using a GiST index on points
Previous Message Etsuro Fujita 2012-03-12 10:21:06 NOT NULL violation error handling in file_fdw