Re: loss of transactions in streaming replication

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: loss of transactions in streaming replication
Date: 2011-10-19 02:28:12
Message-ID: CA+TgmobdUdG-2D_=kpLwzpyoed9PH8+pHsubRjBUgCz_OaGwqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 14, 2011 at 7:51 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Thu, Oct 13, 2011 at 10:08 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Wed, Oct 12, 2011 at 10:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Wed, Oct 12, 2011 at 5:45 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> In 9.2dev and 9.1, when walreceiver detects an error while sending data to
>>>> WAL stream, it always emits ERROR even if there are data available in the
>>>> receive buffer. This might lead to loss of transactions because such
>>>> remaining data are not received by walreceiver :(
>>>
>>> Won't it just reconnect?
>>
>> Yes if the master is running normally. OTOH, if the master is not running (i.e.,
>> failover case), the standby cannot receive again the data which it failed to
>> receive.
>>
>> I found this issue when I shut down the master. When the master shuts down,
>> it sends the shutdown checkpoint record, but I found that the standby failed
>> to receive it.
>
> Patch attached.
>
> The patch changes walreceiver so that it doesn't emit ERROR just yet even
> if it fails to send data to WAL stream. Then, after all available data have been
> received and flushed to the disk, it emits ERROR.
>
> If the patch is OK, it should be backported to v9.1.

Convince me. :-)

My reading of the situation is that you're talking about a problem
that will only occur if, while the master is in the process of
shutting down, a network error occurs. I am not sure it's a good idea
to convolute the code to handle that case, because (1) there are going
to be many similar situations where nothing within our power is
sufficient to prevent WAL from failing to make it to the standby and
(2) for this marginal improvement, you're giving up including
PQerrorMessage(streamConn) in the error message that ultimately gets
omitted, which seems like a substantial regression as far as
debuggability is concerned. Even if we do decide that we want the
change in behavior, I see no compelling reason to back-patch it.
Stable releases are supposed to be stable, not change behavior because
we thought of something we like better than what we originally
released.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jun Ishiduka 2011-10-19 02:47:08 Re: Online base backup from the hot-standby
Previous Message Robert Haas 2011-10-19 02:19:14 Re: [v9.2] Fix Leaky View Problem