Re: Streaming replication, some small issues

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication, some small issues
Date: 2009-12-08 11:38:31
Message-ID: 3f0b79eb0912080338m71505de4g1aa61e6229fc1666@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 8, 2009 at 5:30 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> A couple of small issues spotted while reviewing the streaming
> replication patch:

Thanks for the review!

> - Because sentPtr is initialized to zeros, GetOldestWALSendPointer will
> return zero before a just-launched WAL sender has sent its first
> message. That can lead to WAL files that are still needed by another
> standby to be deleted prematurely.

Oops! I fixed that (in my git repository, see the bottom of this mail).

> - If a WAL file is not found in the master for some reason, standby goes
> into an infinite loop retrying it:
>
> ERROR:  could not read xlog records: FATAL:  could not open file
> "pg_xlog/000000010000000000000000" (log file 0, segment 0): No such file
> or directory

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01393.php
>> walreceiver shouldn't die on connection error, just to be restarted by
>> startup process. Can we add error handling a la bgwriter and have a
>> retry loop within walreceiver?

As the result of your current and previous comment, you mean that
walreceiver should always retry connecting to the primary after
a connection error occurs in PQgetXLogData/PQputXLogRecPtr, and
exit after the other errors occur? Though I'm not sure whether
we can determine the error type precisely.

> - It's possible to shut down master, change max_wal_senders to 0,
> restart and do an operation like CLUSTER which then skips WAL-logging.
> Then shutdown, change max_wal_senders back to non-zero. All this while
> the standby is running. Leads to a corrupt standby.

I've regarded this case as a restriction. But, how do you think
we should cope with it?

1. Restriction: only documentation is required?
2. Needs safe guard:
- forbid the primary to perform such operations while the
standby is running?
- emit PANIC error on the standby if the primary which lost sync
restarts?
3. Full solution: automatic resync mechanism is required?

> I've also pushed a couple of small cosmetic changes to replication
> branch at git://git.postgresql.org/git/users/heikki/postgres.git

Your changes seem good.

I pulled and merged your changes into my repository:

git://git.postgresql.org/git/users/fujii/postgres.git
branch: replication

And, I pushed the capability of replication of a backup history file
into the repository.

> I'll continue reviewing...

Thanks a lot!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2009-12-08 11:54:58 Re: Streaming replication, some small issues
Previous Message Robert Haas 2009-12-08 10:46:04 Re: EXPLAIN BUFFERS