Re: several problems in pg_receivexlog

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: several problems in pg_receivexlog
Date: 2012-07-17 17:58:38
Message-ID: CAHGQGwGxW85ncRJQjk3-p8Z=D_q+dkHq4oJ4xCd4ki6Xx=Tq+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 13, 2012 at 1:15 AM, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> On Thu, Jul 12, 2012 at 6:07 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Thu, Jul 12, 2012 at 8:39 PM, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>>> On Tue, Jul 10, 2012 at 7:03 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> On Tue, Jul 10, 2012 at 3:23 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>>> Hi,
>>>>>
>>>>> I found several problems in pg_receivexlog, e.g., memory leaks,
>>>>> file-descripter leaks, ..etc. The attached patch fixes these problems.
>>>>>
>>>>> ISTM there are still some other problems in pg_receivexlog, so I'll
>>>>> read it deeply later.
>>>>
>>>> While pg_basebackup background process is streaming WAL records,
>>>> if its replication connection is terminated (e.g., walsender in the server
>>>> is accidentally terminated by SIGTERM signal), pg_basebackup ends
>>>> up failing to include all required WAL files in the backup. The problem
>>>> is that, in this case, pg_basebackup doesn't emit any error message at all.
>>>> So an user might misunderstand that a base backup has been successfully
>>>> taken even though it doesn't include all required WAL files.
>>>
>>> Ouch. That is definitely a bug if it behaves that way.
>>>
>>>
>>>> To fix this problem, I think that, when the replication connection is
>>>> terminated, ReceiveXlogStream() should check whether we've already
>>>> reached the stop point by calling stream_stop() before returning TRUE.
>>>> If we've not yet (this means that we've not received all required WAL
>>>> files yet), ReceiveXlogStream() should return FALSE and
>>>> pg_basebackup should emit an error message. Comments?
>>>
>>> Doesn't it already return false because it detects the error of the
>>> connection? What's the codepath where we end up returning true even
>>> though we had a connection failure? Shouldn't that end up under the
>>> "could not read copy data" branch, which already returns false?
>>
>> You're right. If the error is detected, that function always returns false
>> and the error message is emitted (but I think that current error message
>> "pg_basebackup: child process exited with error 1" is confusing....),
>> so it's OK. But if walsender in the server is terminated by SIGTERM,
>> no error is detected and pg_basebackup background process gets out
>> of the loop in ReceiveXlogStream() and returns true.
>
> Oh. Because the server does a graceful shutdown. D'uh, of course.
>
> Then yes, your suggested fix seems like a good one.

Attached patch adds the fix.

Also I found I had forgotten to set the file descriptor to -1 at the end of
ReceiveXlogStream(), in previously-committed my patch. Attached patch
fixes this problem.

Regards,

--
Fujii Masao

Attachment Content-Type Size
pgreceivexlog_check_stoppoint_v1.patch application/octet-stream 1002 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-07-17 18:01:10 Re: CompactCheckpointerRequestQueue versus pad bytes
Previous Message Alvaro Herrera 2012-07-17 17:56:19 Re: isolation check takes a long time