Re: replication timeout in pg_basebackup

From: "Aggarwal, Ajay" <aaggarwal(at)verizon(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: replication timeout in pg_basebackup
Date: 2014-03-10 20:07:45
Message-ID: 3B7431C850F4F347885C4CE5DD7B401993A93A07@MIA20725MBX891A.apps.tmrk.corp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks Hari Babu.

I think what is happening is that my dirty cache builds up quickly for the volume where I am backing up. This would trigger flush of these dirty pages to the disk. While this flush is going on pg_basebackup tries to do fsync() on a received WAL file and gets blocked.

While in this state, i.e. when dirty page count is high, following are the results of pg_test_fsync

# /usr/pgsql-9.2/bin/pg_test_fsync -f /backup/fsync_test
2 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 16.854 ops/sec
fdatasync 15.242 ops/sec
fsync 0.187 ops/sec
fsync_writethrough n/a
open_sync 14.747 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 6.137 ops/sec
fdatasync 14.899 ops/sec
fsync 0.007 ops/sec
fsync_writethrough n/a
open_sync 1.450 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 13.486 ops/sec
2 * 8kB open_sync writes 6.006 ops/sec
4 * 4kB open_sync writes 3.446 ops/sec
8 * 2kB open_sync writes 1.400 ops/sec
16 * 1kB open_sync writes 0.859 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.009 ops/sec
write, close, fsync 0.008 ops/sec

Non-Sync'ed 8kB writes:
write 99415.368 ops/sec

However when backups are not going on and dirty pages count is low, below are the results of this test

# /usr/pgsql-9.2/bin/pg_test_fsync -f /backup/fsync_test
2 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 1974.243 ops/sec
fdatasync 1410.804 ops/sec
fsync 181.129 ops/sec
fsync_writethrough n/a
open_sync 547.389 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 290.109 ops/sec
fdatasync 962.378 ops/sec
fsync 158.987 ops/sec
fsync_writethrough n/a
open_sync 642.309 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 1014.456 ops/sec
2 * 8kB open_sync writes 627.964 ops/sec
4 * 4kB open_sync writes 340.313 ops/sec
8 * 2kB open_sync writes 173.581 ops/sec
16 * 1kB open_sync writes 103.236 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 244.670 ops/sec
write, close, fsync 207.248 ops/sec

Non-Sync'ed 8kB writes:
write 202216.900 ops/sec

________________________________
From: Haribabu Kommi [kommi(dot)haribabu(at)gmail(dot)com]
Sent: Monday, March 10, 2014 1:42 AM
To: Aggarwal, Ajay
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] replication timeout in pg_basebackup

On Mon, Mar 10, 2014 at 12:52 PM, Aggarwal, Ajay <aaggarwal(at)verizon(dot)com<mailto:aaggarwal(at)verizon(dot)com>> wrote:
Our environment: Postgres version 9.2.2 running on CentOS 6.4

Our backups using pg_basebackup are frequently failing with following error

"pg_basebackup: could not send feedback packet: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request."

We are invoking pg_basebackup with these arguments : pg_basebackup -D backup_dir -X stream -l backup_dir

In postgres logs we see this log message "terminating walsender process due to replication timeout".

Our replication timeout is default 60 seconds. If we increase the replication time to say 180 seconds, we see better results but backups still fail occasionally.

Running strace on pg_basebackup process, we see that the fsync() call takes significant time and could be responsible for causing this timeout in postgres.

Use the pg_test_fsync utility which is available in postgresql contrib module to test your system sync methods performance.

Has anybody else run into the same issue? Is there a way to run pg_basebackup without fsync() ?

As of now there is no such options available, I feel it is better to find why the sync is taking time?

Regards,
Hari Babu
Fujitsu Australia

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Brian Crowell 2014-03-10 20:41:35 Recovering from failed transaction
Previous Message Daniel Verite 2014-03-10 19:12:24 Re: libpq - lack of support to set the fetch size