Re: streaming replication breaks horribly if master crashes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication breaks horribly if master crashes
Date: 2010-06-16 20:56:53
Message-ID: 14597.1276721813@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> The first problem I noticed is that the slave never seems to realize
> that the master has gone away. Every time I crashed the master, I had
> to kill the wal receiver process on the slave to get it to reconnect;
> otherwise it just sat there waiting, either forever or at least for
> longer than I was willing to wait.

TCP timeout is the answer there.

> More seriously, I was able to demonstrate that the problem linked in
> the thread above is real: if the master crashes after streaming WAL
> that it hasn't yet fsync'd, then on recovery the slave's xlog position
> is ahead of the master.

So indeed we'd better change walsender to not get ahead of the fsync'd
position. And probably also warn people to not disable fsync on the
master, unless they're willing to write it off and fail over at any
system crash.

> I don't know what to do about this, but I'm pretty sure we can't ship it as-is.

Doesn't seem tremendously insoluble from here ...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2010-06-16 21:58:13 Re: ANNOUNCE list (was Re: New PGXN Extension site)
Previous Message Rafael Martinez 2010-06-16 20:38:14 Re: streaming replication breaks horribly if master crashes