BUG #5585: SSL problems with long COPYs

From: "Karl Denninger" <karl(at)denninger(dot)net>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #5585: SSL problems with long COPYs
Date: 2010-08-01 06:08:16
Message-ID: 201008010608.o7168GX6042155@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 5585
Logged by: Karl Denninger
Email address: karl(at)denninger(dot)net
PostgreSQL version: 8.4.4
Operating system: FreeBSD 8.0
Description: SSL problems with long COPYs
Details:

This is a copy of a message I posted this evening on the SLONY list.

Synopsis: With SSL ON a large table copy containing a BYTEA field fails
repeatedly a few minutes into the operation. The table itself is some ~25gb
in size, and a dump and restore of the table onto a different machine (or
the same machine) can be completed without incident (e.g. the data is known
to not be corrupt.)

However, during a COPY performed by SLONY 2.0.4, the system says it lost the
connection. It in fact did not - the copy is running on the server,
although the output is closed (thus it is dumping into a closed file
descriptor)

Perusal of the server logs discloses that the SSL module is apparently
internally faulting, although the actual error message is less than useful
in trying to track down the cause.

Here is the copy of the email I just sent to the SLONY list.

Note that the copy, while not yet complete, is proceeding normally (and
about 20% in - a good 10x further than it ever got with SSL enabled) - with
SSL turned off. It is far enough along at this point that I am reasonably
confident that with SSL off it will complete as expected.

Karl Denninger wrote:
> > I upgraded somewhat-recently from 2.0.2 to 2.0.4, and now I've got a
> > serious problem.
> >
> > The reason for the "gotta do it now" was that somehow one of the tables
> > got out of sync, and a delete was failing to propagate - hanging the
> > process.
> >
> > OK, ok, so 2.0.2 with Postgres 8.4.4 is a bit old and mismatched. So I
> > upgraded to 2.0.4 on all the nodes, and told the subscriber to reload -
> > ditched the client config and re-subscribed the sets.
> >
> > All went well until a very large table came up - it failed.
> >
> > There's no error in the logs indicating why, other than the following:
> >
> > Jul 31 22:52:53 dbms TICKER[70295]: [153-1] CONFIG
remoteWorkerThread_3:
> > copy table "public"."images"
> > Jul 31 22:52:53 dbms TICKER[70295]: [154-1] CONFIG
remoteWorkerThread_3:
> > Begin COPY of table "public"."images"
> > Jul 31 22:54:24 dbms TICKER[70295]: [155-1] ERROR
remoteWorkerThread_3:
> > PGgetCopyData() server closed the connection unexpectedly
> > Jul 31 22:54:24 dbms TICKER[70295]: [155-2] This probably means the
> > server terminated abnormally
> > Jul 31 22:54:24 dbms TICKER[70295]: [155-3] before or while
> > processing the request.
> > Jul 31 22:54:24 dbms TICKER[70295]: [156-1] WARN
remoteWorkerThread_3:
> > data copy for set 1 failed 1 times - sleep 15 seconds
> >
> > And in 15 seconds, the entire process of trying to re-init the node
> > starts over - from the beginning!
> >
> > Near as I can tell, it's failing pretty early on.
> >
> > The source host is fine. This particular table contains a BYTEA field,
> > and it's BIG. ~20ish gigs big. But I've re-initialized in the past
> > without problems. I tried going back to 2.0.2, and that still fails.
> > Both servers are running with encoding set to SQL_ASCII, if it matters.
> >
> > When it fails the SERVER's COPY is still running - so the client is
> > definitely wrong on the reported error. I have NOTHING in the server's
> > SLON log and there are no comms problems between the two hosts.
> >
> > I'm going to run a dump of the table and see if I can manually bring it
> > over to the other host and load it. There's nothing going on with the
> > master that implicates the data being damaged......
> >
> > Ideas?
> >
> > -- Karl
> >
Oh boy, I think I know what's going on....

This looks like a problem in the SSL code (!!!!!)

I shut off SSL and I'm now ~4x further than where the copy has
previously failed. For obvious reasons this is decidedly un-good - a
perusal of the server's log (not slony's, but the server's) disclosed
several SSL errors in the log, which would account for the problem and
what the client thought were disconnects - but weren't.

Specifically, I got a bunch of these...

Jul 31 23:00:00 tickerforum postgres[27093]: [9593-2] STATEMENT: copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu
mb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-1] LOG: SSL error:
internal error
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-2] STATEMENT: copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu
mb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:01 tickerforum postgres[27093]: [9595-1] LOG: SSL error:
internal error

This isn't SLONY's issue, but it's definitely a problem. I'll report it
over on the Postgres list in the morning...

In the meantime, I think I can get the copy to go with SSL off, and then
turn it back on once the copy is complete.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alex Hunsaker 2010-08-01 06:23:42 Re: BUG #5585: SSL problems with long COPYs
Previous Message Euler Taveira de Oliveira 2010-07-31 13:11:50 Re: help pleas