Re: Proposal for 9.1: WAL streaming from WAL buffers

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal for 9.1: WAL streaming from WAL buffers
Date: 2010-06-30 09:36:48
Message-ID: AANLkTimIr1XIq4jNpSB32vtZc4t7qpD18ofvaDOWtVg3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 30, 2010 at 11:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Maybe.  As Heikki pointed out upthread, the standby can't even write
> the WAL to back to the OS until it's been fsync'd on the master
> without risking the problem under discussion.

If we change the startup process so that it doesn't go ahead of the
master's fsync location even after the walreceiver is terminated,
we would have no need to worry about that risk. For further robustness,
the walreceiver might be able to zero the WAL records which have not
been fsync'd on the master yet, when being terminated.

But, if the standby crashes after the master crashes, restart of the
standby might replay that non-fsync'd WAL wrongly because it cannot
remember the master's fsync location. In this case, if we promote the
standby to the master, we still don't have to worry about that risk.
But instead of performing a failover, if we restart the master and
make the standby connect to the master again, the database on the standby
would get corrupted.

For now, I don't have good idea to avoid that database corruption by
the double failure (crash of both master and standby)...

> So we can stream the
> WAL from master to standby as long as the standby just buffers it in
> memory (or somewhere other than the usual location in pg_xlog).

Yeah, I was just thinking the same thing. But the problem is that the
buffer size might become too big (might be bigger than 16MB). For
example, synchronous_commit = off and wal_writer_delay = 10000ms on
the master would delay the fsync significantly and increase the buffer
size on the standby.

> Before we get too busy frobnicating this gonkulator, I'd like to see a
> little more discussion of what kind of performance people are
> expecting from sync rep.  Sounds to me like the best we can expect
> here is, on every commit: (a) wait for master fsync to complete, (b)
> send message to standby, (c) wait for reply for reply from standby
> indicating that fsync is complete on standby.  Even assuming that the
> network overhead is minimal, that halves the commit rate.  Are the
> people who want sync rep OK with that?  Is there any way to do better?

(c) would depend on the synchronization mode the user chooses:

#1 Wait for WAL to be received by the standby
#2 Wait for WAL to be received and flushed by the standby
#3 Wait for WAL to be received, flushed and replayed by the standby

(a) would depend on synchronous_commit. Personally I'm interested in
disabling synchronous_commit on the master and choosing #1 as the sync
mode. Though this may be very optimistic configuration :)

The point for performance of sync rep is to parallelize (a) and (b)+(c),
I think. If they are performed in a serial manner, the performance
overhead on the master would become high.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-06-30 10:11:43 Re: [BUGS] Server crash while trying to read expression using pg_get_expr()
Previous Message Jim Nasby 2010-06-30 08:25:27 Re: Adding regexp_match() function