Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: BUG #7534: walreceiver takes long time to detect n/w breakdown
Date: 2012-10-18 16:48:31
Message-ID: CAHGQGwE73DhFYbgN_8rCz1mpmJwjg4vkTrEBzm=X0MGOjW=wjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Oct 16, 2012 at 9:31 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 15.10.2012 19:31, Fujii Masao wrote:
>>
>> On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>>
>>> On 15.10.2012 13:13, Heikki Linnakangas wrote:
>>>>
>>>>
>>>> Oh, I didn't remember that we've documented the specific structs that we
>>>> pass around. It's quite bogus anyway to explain the messages the way we
>>>> do currently, as they are actually dependent on the underlying
>>>> architecture's endianess and padding. I think we should refactor the
>>>> protocol to not transmit raw structs, but use pq_sentint and friends to
>>>> construct the messages. This was discussed earlier (see
>>>>
>>>>
>>>> http://archives.postgresql.org/message-id/4FE2279C.2070506@enterprisedb.com),
>>>> I think there's consensus that 9.3 would be a good time to do that as we
>>>> changed the XLogRecPtr format anyway.
>>>
>>>
>>>
>>> This is what I came up with. The replication protocol is now
>>> architecture-independent. The WAL format itself is still
>>> architecture-independent, of course, but this is useful if you want to
>>> e.g
>>> use pg_receivexlog to back up a server that runs on a different platform.
>>>
>>> I chose the int64 format to transmit timestamps, even when compiled with
>>> --disable-integer-datetimes.
>>>
>>> Please review if you have the time..
>>
>>
>> Thanks for the patch!
>>
>> When I ran pg_receivexlog, I encountered the following error.
>
>
> Yeah, clearly I didn't test this near enough...
>
> I fixed the bugs you bumped into, new version attached.

Thanks for updating the patch!

We should remove the check of integer_datetime by pg_basebackup
background process and pg_receivexlog? Currently, they always check
it, and then if its setting value is not the same between a client and
server, they fail. Thanks to the patch, ISTM this check is no longer
required.

+ pq_sendint64(&reply_message, GetCurrentIntegerTimestamp());

In XLogWalRcvSendReply() and XLogWalRcvSendHSFeedback(),
GetCurrentTimestamp() is called twice. I think that we can skip the
latter call if integer-datetime is enabled because the return value of
GetCurrentTimestamp() and GetCurrentIntegerTimestamp() is in the
same format. It's worth reducing the number of GetCurrentTimestamp()
calls, I think.

elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
- (uint32) (reply_message.write >> 32), (uint32) reply_message.write,
- (uint32) (reply_message.flush >> 32), (uint32) reply_message.flush,
- (uint32) (reply_message.apply >> 32), (uint32) reply_message.apply);
+ (uint32) (writePtr >> 32), (uint32) writePtr,
+ (uint32) (flushPtr >> 32), (uint32) flushPtr,
+ (uint32) (applyPtr >> 32), (uint32) applyPtr);

elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
- (uint32) (reply.write >> 32), (uint32) reply.write,
- (uint32) (reply.flush >> 32), (uint32) reply.flush,
- (uint32) (reply.apply >> 32), (uint32) reply.apply);
+ (uint32) (writePtr >> 32), (uint32) writePtr,
+ (uint32) (flushPtr >> 32), (uint32) flushPtr,
+ (uint32) (applyPtr >> 32), (uint32) applyPtr);

Isn't it worth logging not only WAL location but also the replyRequested
flag in these debug message?

The remaining of the patch looks good to me.

>> + hdrlen = sizeof(int64) + sizeof(int64) +
>> sizeof(int64);
>> + hdrlen = sizeof(int64) + sizeof(int64) +
>> sizeof(char);
>>
>> These should be macro, to avoid calculation overhead?
>
>
> The compiler will calculate this at compilation time, it's going to be a
> constant at runtime.

Yes, you're right.

Regards,

--
Fujii Masao

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2012-10-18 17:57:04 Re: BUG #7611: \copy (and COPY?) incorrectly parses nul character for windows-1252
Previous Message Maxim Boguk 2012-10-18 16:03:05 Re: BUG #7612: Wrong result with join between two values () set

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Farina 2012-10-18 16:56:05 Re: xlog filename formatting functions in recovery
Previous Message Tom Lane 2012-10-18 16:38:50 Re: Deprecations in authentication