Re: Synch Rep for CommitFest 2009-07

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 09:00:07
Message-ID: 4A5EEC17.6010903@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Fujii Masao wrote:
> On Thu, Jul 16, 2009 at 6:03 AM, Heikki
> Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> 1. Change the way synchronization is done when standby connects to
>> primary. After authentication, standby should send a message to primary,
>> stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
>> segment name). Primary starts streaming WAL starting from that point,
>> and keeps streaming forever. pg_read_xlogfile() needs to be removed.
>
> I assume that <begin> should indicate the location of the last valid record.
> In other words, at first the standby tries to recover by using only the XLOG
> files which exist in its archive or pg_xlog. When it has reached the last valid
> record, it requests the XLOG records which follow <begin> to the primary.
> Is my understanding OK?

Yes.

> http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php
> As I described before, the XLOG file which the standby creates should be
> recoverable. So, when <begin> indicates the middle of the XLOG file, the
> primary should start sending the records from the head of the file including
> <begin>. Is this OK?
>
> Or, the primary should start from <begin>? In this case, since we can
> expect that the incomplete file including <begin> would exist in also the
> standby, the records following <begin> need to be appended into it.

I would expect the standby to append to the partial XLOG file.

> And, if that incomplete file is the restored one from archive, it would need
> to be renamed from a temporary name before being appended.

The archive should not normally contain partial XLOG files, only if you
manually copy one there after primary has crashed. So I don't think
that's something we need to support.

> A timeline/backup history file is also required for recovery, but it's not
> found in the standby. So, they need to be shipped from the primary, and
> this capability is provided by pg_read_xlogfile(). If removing the function,
> how should we transfer those history files? The function similar to
> pg_read_xlogfile() with which the filename needs to be specified is still
> necessary?

Hmm. You only need the timeline history file if the base backup was
taken in an earlier timeline. That situation would only arise if you
(manually) take a base backup, restore to a server (which creates a new
timeline), and then create a slave against that server. At least in the
1st phase, I think we can assume that the standby has access to the same
archive, and will find the history file from there. If not, throw an
error. We can add more bells and whistles later.

> CHECKPOINT should not recycle the XLOG files following the file which
> is requested by the standby in that moment. So, we need to tweak the
> recycling policy.

Yep.

>> 3. Need to support multiple WALSenders. While multiple slave support
>> isn't 1st priority right now, it's not acceptable that a new WALSender
>> can't connect while one is active already. That can cause trouble in
>> case of network problems etc.
>
> Sorry, I didn't get your point. You think multiple slave support isn't 1st
> priority, and yet why should multiple walsender mechanism be necessary?
> Can you describe the problem cases in more detail?

As the patch stands, new walsender connections are refused when one is
active already. What if the walsender connection is in a zombie state?
For example, it's trying to send WAL to the slave, but the network
connection is down, and the packets are going to a black hole. It will
take a while for the TCP layer to declare the connection dead, and close
the socket. During that time, you can't connect a new slave to the
master, or the same slave using a better network connection.

The most robust way to fix that is to support multiple walsenders. The
zombie walsender can take its time to die, while the new walsender
serves the new connection. You could tweak SO_TIMEOUTs and stuff, but
even then the standby process could be in some weird hung state.

And of course, when we get around to add support for multiple slaves,
we'll have to do that anyway. Better get it right to begin with.

>> 4. It is not acceptable that normal backends have to wait for walsender
>> to send data.
>
> Umm... this is true in asynchronous replication case. Also true while the
> standby is catching up with the primary. After those servers get into
> synchronization, the backend should wait for walsender to send data (and
> also walreceiver to write/fsync data) before returning "success" of COMMIT
> to the client. Is my understanding right?

Even in synchronous replication, a backend should only have to wait when
it commits. You would only see the difference with very large
transactions that write more WAL than fits in wal_buffers, though, like
data loading.

> In current Synch Rep, the backend basically doesn't wait for walsender in
> asynchronous mode. But only when wal_buffers is filled with unsent data,
> the backend waits for walsender to send data because there is no room to
> insert new data. You suggest only that this problem case should be solved?

Right, that is the problem.

>> That means that connecting a standby behind a slow
>> connection to the primary can grind the primary to a halt.
>
> This is the fate of *synchronous* replication, isn't it? If a user want to get
> around such problem, asynchronous mode should be chosen, I think.

Right. But as the patch stands, asynchronous mode has the same problem,
which is not acceptable.

> Sounds good. I'll advance development in stages as you suggested.

Thanks!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rick Gigger 2009-07-16 09:34:56 Re: Synch Rep for CommitFest 2009-07
Previous Message Andres Freund 2009-07-16 08:48:45 Review remove {join,from}_collapse_limit, add enable_join_ordering