Wide area replication postgres 9.1.6 slon 2.1.2 large table failure.

From: Tory M Blue <tmblue(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Wide area replication postgres 9.1.6 slon 2.1.2 large table failure.
Date: 2013-01-12 05:49:21
Message-ID: CAEaSS0akPm4zBm5-Ot9kHvCH=xQwx9QFNUwj8nuzbicZhx+X0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So I started this thread on the slon forum, and they mentioned that I/we
should ask here.

Postgres 9.1.4 slon 2.1.1
-and-
Postgres 9.1.6 slon 2.1.2

Scenario:

Node 1, is on gig circut and is the master (West Coast)

Node 2, is also on a gig circuit and is the slave (Georgia)

Symptoms, slon immediately dies after transferring the biggest table in the
set (this happens with 2 of 3 sets, the set that actually completes has no
large tables).

Set 1 has a table that takes just under 6000 seconds, and set 2 has a table
that takes double that, and again it completes.

1224459-2013-01-11 14:21:10 PST CONFIG remoteWorkerThread_1: 5760.913
seconds to copy table "cls"."listings"
1224560-2013-01-11 14:21:10 PST CONFIG remoteWorkerThread_1: copy table
"cls"."customers"
1224642-2013-01-11 14:21:10 PST CONFIG remoteWorkerThread_1: Begin COPY of
table "cls"."customers"
1224733-2013-01-11 14:21:10 PST ERROR remoteWorkerThread_1: "select
"_admissioncls".copyFields(8);" <--- this has the proper data
1224827:2013-01-11 14:21:10 PST WARN remoteWorkerThread_1: data copy for
set 1 failed 1 times - sleep 15 seconds

Now in terms of postgres, if I do a copy from node 1 to node 2 the large
table (<2 hors) completes without issue.

From Node 2:
-bash-4.1$ psql -h idb02 -d admissionclsdb -c "copy cls.listings to stdout"
| wc
4199441 600742784 6621887401

This worked fine.

I get no errors in the postgres logs, there is no network disconnect and
since I can do a copy over the wire that completes, I'm at a loss. I don't
know what to look at, what to look for or what to do. Obviously this is
the wrong place to slon issues.

One of the slon developers stated;
"I wonder if there's something here that should get bounced over to
pgsql-hackers or such; we're poking at a scenario here where the use
of COPY to stream data between systems is proving troublesome, and
perhaps there may be meaningful opinions over there on that."

If a copy of the same table that seems to be at the end of a slon failed
attempt and it will complete with a copy, I'm just not sure what is going
on.

Any suggestions, please ask for more data, I can do anything to the slave
node, it's a bit tougher on the source, but I can arrange to make changes
to it if need be.

I just upgraded to 9.1.6 and slon 2.1.2 but prior tests were on 9.1.4 and
slon 2.1.1 and a mix of postgres 9.1.4 slon 2.1.1 and postgres 9.1.6 slon
2.1.1 (node 2)

The other difference is node 1 is running on Fedora12 and node 2 is running
CentOS 6.2

Thanks in advance
Tory

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit kapila 2013-01-12 05:51:06 Re: Proposal for Allow postgresql.conf values to be changed via SQL [review]
Previous Message Amit kapila 2013-01-12 03:50:10 Re: Performance Improvement by reducing WAL for Update Operation