Re: Synch Rep for CommitFest 2009-07

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 08:28:42
Message-ID: 3f0b79eb0907160128h53d0c5feh7c4e4815c8471e67@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, Jul 16, 2009 at 6:03 AM, Heikki
Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I don't think there's much point assigning more reviewers to Synch Rep
> at this point. I believe we have consensus on four major changes:

Thanks for clarifying the issues! Okey, I'll rework the patch.

> 1. Change the way synchronization is done when standby connects to
> primary. After authentication, standby should send a message to primary,
> stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
> segment name). Primary starts streaming WAL starting from that point,
> and keeps streaming forever. pg_read_xlogfile() needs to be removed.

I assume that <begin> should indicate the location of the last valid record.
In other words, at first the standby tries to recover by using only the XLOG
files which exist in its archive or pg_xlog. When it has reached the last valid
record, it requests the XLOG records which follow <begin> to the primary.
Is my understanding OK?

http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php
As I described before, the XLOG file which the standby creates should be
recoverable. So, when <begin> indicates the middle of the XLOG file, the
primary should start sending the records from the head of the file including
<begin>. Is this OK?

Or, the primary should start from <begin>? In this case, since we can
expect that the incomplete file including <begin> would exist in also the
standby, the records following <begin> need to be appended into it.
And, if that incomplete file is the restored one from archive, it would need
to be renamed from a temporary name before being appended.

A timeline/backup history file is also required for recovery, but it's not
found in the standby. So, they need to be shipped from the primary, and
this capability is provided by pg_read_xlogfile(). If removing the function,
how should we transfer those history files? The function similar to
pg_read_xlogfile() with which the filename needs to be specified is still
necessary?

> 2. The primary should have no business reading back from the archive.
> The standby can read from the archive, as it can today.

In this case, a backup history file should be stored in pg_xlog for a while,
because it might be requested by the standby. So far pg_start_backup()
has removed the previous backup history file soon. We should introduce
a new GUC parameter to determine how many backup history files should
exist in pg_xlog?

CHECKPOINT should not recycle the XLOG files following the file which
is requested by the standby in that moment. So, we need to tweak the
recycling policy.

> 3. Need to support multiple WALSenders. While multiple slave support
> isn't 1st priority right now, it's not acceptable that a new WALSender
> can't connect while one is active already. That can cause trouble in
> case of network problems etc.

Sorry, I didn't get your point. You think multiple slave support isn't 1st
priority, and yet why should multiple walsender mechanism be necessary?
Can you describe the problem cases in more detail?

> 4. It is not acceptable that normal backends have to wait for walsender
> to send data.

Umm... this is true in asynchronous replication case. Also true while the
standby is catching up with the primary. After those servers get into
synchronization, the backend should wait for walsender to send data (and
also walreceiver to write/fsync data) before returning "success" of COMMIT
to the client. Is my understanding right?

In current Synch Rep, the backend basically doesn't wait for walsender in
asynchronous mode. But only when wal_buffers is filled with unsent data,
the backend waits for walsender to send data because there is no room to
insert new data. You suggest only that this problem case should be solved?

> That means that connecting a standby behind a slow
> connection to the primary can grind the primary to a halt.

This is the fate of *synchronous* replication, isn't it? If a user want to get
around such problem, asynchronous mode should be chosen, I think.

> walsender
> needs to be able to read data from disk, not just from shared memory. (I
> raised this back in December
> http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com)

OK, I'll try it.

> As a hint, I think you'll find it a lot easier if you implement only
> asynchronous replication at first. That reduces the amount of
> inter-process communication a lot. You can then add synchronous
> capability in a later commitfest. I would also suggest that for point 4,
> you implement WAL sender so that it *only* reads from disk at first, and
> only add the capability send from wal_buffers later on, and only if
> performance testing shows that it's needed.

Sounds good. I'll advance development in stages as you suggested.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2009-07-16 08:29:04 Re: [GENERAL] pg_migrator not setting values of sequences?
Previous Message Jaime Casanova 2009-07-16 07:57:27 Review: support for multiplexing SIGUSR1