From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
---|---|
To: | Thom Brown <thom(at)linux(dot)com> |
Cc: | Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Strange issues with 9.2 pg_basebackup & replication |
Date: | 2012-05-16 23:29:00 |
Message-ID: | CAHGQGwF86Mbp7z9Heuc93hWsrJ46JMvcLikDXc6O0adXG0V=+w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, May 17, 2012 at 1:07 AM, Thom Brown <thom(at)linux(dot)com> wrote:
> On 16 May 2012 11:36, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Wed, May 16, 2012 at 2:29 AM, Thom Brown <thom(at)linux(dot)com> wrote:
>>> On 15 May 2012 13:15, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> On Wed, May 16, 2012 at 1:36 AM, Thom Brown <thom(at)linux(dot)com> wrote:
>>>>> However, this isn't true when I restart the standby. I've been
>>>>> informed that this should work fine if a WAL archive has been
>>>>> configured (which should be used anyway).
>>>>
>>>> The WAL archive should be shared by master-replica and replica-replica,
>>>> and recovery_target_timeline should be set to latest in replica-replica.
>>>> If you configure that way, replica-replica would successfully reconnect to
>>>> master-replica with no need to restart it.
>>>
>>> I had set the archive_command on the primary, then produced a base
>>> backup which would have copied the archive settings, but I also added
>>> a corresponding recovery_command setting, so everything was pointing
>>> at the same archive.
>>
>> Hmm.. when doing the same, the replica-replica successfully reconnected
>> to the master-replica after I shutdown the master-master and promoted the
>> master-replica. archive_command is the same in three servers,
>> restore_command is the same in two standby servers (i.e., master-replica
>> and replica-replica), and recovery_target_timeline is set to 'latest' in two
>> standby servers.
>
> I didn't shut down the master-master, but I didn't expect to need to.
>
> I also had recovery_target_timeline set to latest. I also tried
> explicitly setting it to the new timeline, and got an error saying
> there was no such timeline.
What did the replica-replica do after you got such an error? Repeated
such an error? Emit PANIC error and exited? Got stuck? Successfully
reconnected to the master-replica? ....
In theory, the gap of timeline should be resolved as follows:
1. promote master-replica, which terminates cascade replication.
2. while replica-replica is repeating to reconnect to master-replica,
if it finds new timeline history file in the archive, it adjusts
its timeline
to new one.
3. as the result of promotion, master-replica increments its timeline,
creates the timeline history file and archives it.
4. finally replica-replica finds new timeline history file in the archive,
adjusts its timeline to new one, and successfully reconnects to the
master-replica.
Note that you might see the timeline mismatch error some times
before replication is successfully restarted because of the timing
problem.
>
>>> But in any case, shouldn't the replication connection be
>>> terminated when pg_basebackup is terminated?
>>
>> +1 To do this, we would need to define SIGINT signal handler and make it
>> send QueryCancel packet when Ctrl-C is typed.
>
> Also could we provide some feedback when using the -c spread option,
> when there isn't progress within a short period of time? Something
> like "Waiting for checkpoint. This can take up to
> %checkpoint_timeout%", or something similar, rather than seeing
> nothing happening and wondering if something has gone wrong.
+1, at least for the case where -P option is specified in pg_basebackup.
Regards,
--
Fujii Masao
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2012-05-16 23:58:52 | Re: psql bug |
Previous Message | Bruce Momjian | 2012-05-16 21:30:27 | Re: Draft release notes complete |