Streaming Replication: Observations, Questions and Comments

From: Samba <saasira(at)gmail(dot)com>
To: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Streaming Replication: Observations, Questions and Comments
Date: 2011-08-24 15:33:17
Message-ID: CAKgWO9L=hOqTzuBVP_WjzbbPTPty8BK13nNaFkiNPUH1C9zuTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

We have a postgres-9.0 streaming replication set up where we keep the WAL
segments on the master amounting to 10 GB so that we can survive longer
periods of disconnect between master and slave. We do not use any shared
storage space for archiving WAL logs. (the shared disk server may turn out
to be another point of failure, which we would want to avoid)

Here is our basic configuration parameters in :

postgresql.conf on master:
wal_keep_segments = 640 # previously 32 | # in logfile segments, min 1,
16MB each
archive_mode = on # allows archiving to be done # (change
requires restart)
archive_command = 'cp -v %p /archives/data/pgsql/pg_xlog/%f'

postgresql.conf on slave:
wal_level = hot_standby
hot_standby = on

recovery.conf on slave:
standby_mode = 'on'
primary_conninfo = 'host=ip.add.ress port=5432 user=repman'
trigger_file = '/var/lib/pgsql/data/stop.replication'
restore_command = 'cp -i /archives/data/pgsql/pg_xlog/%f "%p"'

Master and Slave servers are sperated by thousands of miles and the network
bandwidth comprises just an ordinary 1 Mbps DSL line. Both the master server
and the slave server have the /archives partition mounted and synced with
csync2 between master and the slave systems. I'm not sure if this is the
correct way of configuring streaming replication, but I will explain what
worked for us and what we are still left wanting with:

Under heavy inserts/updates/deletes on the master (load generated by stored
procedures), we noticed that the slave went far behind the master and
resulted into breakage of replication. Hence we changed from 32 log file
segments to 640, which corresponds to 10 GB so that we can survive either
very heavy spikes of load or even a week's disconnect of the slave (although
alarms would be raised appropriately for the same effect).

One strange thing I noticed is that the pg_xlogs on the master have outsized
the actual data stored in the database by at least 3-4 times, which was
quite surprising. I'm not sure if 'restore_command' has anything to do with
it. I did not understand why transaction logs would need to be so many times
larger than the actual size of the database, have I done something wrong
somewhere?

Another interesting fact we noticed is that once the replication is broken
for some longer time ( walsender and walreceiver processes have died by this
time), we had to restart not only the slave server but also the master
server, which was quite strange. Shouldn't the master server start (if there
is none running) walsender process the moment it receives a request for
streaming? Similarly, why should the slave be restarted just to start
replication again? why can't these two processes be independently started
and stopped by the postmaster process as and when necessary as per the need
to replicate or not?

Another thing that I noticed was that the slave server has logged that it is
out of sync and hence closing replication but the master did not say
anything about this breakage of replication.

So summing up the above, I would like to have some pointers to understand
the following, which I think will benefit many others as well:

- Do I need to 'archive' since I'm storing quite a significant number of
logfile segments that can help sustain disconnect for almost a week?
- Why did the xlog data accumulate so much as to be 3-4 times that of the
actual size of the database growth during the same period?
- why should we restart the postgres service on slave and/or master if
we need to join a slave back into replication after a long disconnect?
- why is the master not complaining about the loss of replication? (of
course, slave did complain about the disconnect or its inability to continue
accepting data from master on account difference in xlog location).

Some of the above might be because of wrong configuration, while some may be
give hints for future enhancements. I hope this will start a healthy
discussion on the areas where streaming replication needs to be improved and
strengthened.

Thanks and Regards,
Samba

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2011-08-24 15:51:03 Re: init script or procedure
Previous Message Martín Marqués 2011-08-24 15:12:00 Re: question regarding full_page_write