CopyManager bug + fix

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-jdbc(at)postgresql(dot)org
Subject: CopyManager bug + fix
Date: 2010-05-08 07:24:44
Message-ID: i2pda0294261005080024t212944a0n41d9159c6336f712@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

This would have been a bug report with accompanying patch, but I
checked the latest code in CVS and found that the most recent commit
on CopyManager fixes the exact problem I was having. Now this email is
just search engine bait so that others who suffer the problem will
have an easier time finding the solution than I did (nothing useful
showed up in the google searches I did). I'll start with a
description of the problem:

I was seeing several different exceptions when copying data from a
Reader, but the most prevalent one was an exception stating "ERROR:
invalid byte sequence for encoding "UTF8": 0x00"

I spent hours trying to figure out where my code was sticking a NULL
character into the stream before finally giving up and looking to see
if it was a problem in the driver. As soon as I did, I found code in

public long copyIn(final String sql, Reader from, int bufferSize);

which was giving an entire buffer to the UTF-8 encoder even though
only a few bytes were populated. This would always show up for me
only on the last buffer of the file, no matter how large, because the
data that was being written into the PipedWriter that was feeding the
PipedReader was flowing in fast enough to always deliver a full buffer
until the final buffer of the 'file.' I would sometimes get errors
about missing columns, too, though those were less frequent - just
enough to confuse. If you don't want to grab the latest development
code for the driver, the bug is very easily fixed with a change to a
single line of code in the postgresql-jdbc-8.4-701.src tarball
available from the site.

in org/postgresql/copyCopyManager.java, modify line 142 from:

byte[] buf = encoding.encode(new String(cbuf));

to

byte[] buf = encoding.encode(new String(cbuf, 0, len));

which turns out to be identical to the most frequent commit to that file in CVS.

One other thing, however - It is worth mentioning somewhere in the
documentation (there appears to be no documentation of the Copy
feature in the jdbc driver docs that I could find) that using a Reader
instead of InputStream is pretty inefficient because of that encoding
step. It copies bytes from the Reader, instantiates a String (which
copies the bytes into the string), then extracts a byte buffer from
that string, which again copies the bytes. In my case, I had already
been very careful to ensure that the data I was sending was encoded
correctly, so that encoding step was just a bunch of totally redundant
data copies. Even digging into the code, this wouldn't have been
totally obvious if I hadn't gone in with a debugger and stepped all
the way into that code. It would be worth adding a comment so that
someone reading the code might at least realize the problem and move
to the InputStream version of the method instead. In my case, it was
relatively trivial to switch to Piped Input and Output Streams, since
I knew that my data was encoded correctly before I copied it in. An
alternative would be to allow the user of the api to specify whether
the encoding step should be performed or if the data should be
trusted. I may attempt such a patch when I get some spare cycles.

Thanks for the otherwise excellent codebase (which built so easily and
was very easy to read and comprehend). Had I known the code would be
so easy to dig into, I'd probably have found the problem much sooner,
but I tend to exhaust every other possible option, even when I suspect
a piece of 3rd party code, since it is usually so difficult to
comprehend, debug, build, and test 3rd party code.

--sam

Browse pgsql-jdbc by date

  From Date Subject
Next Message Gnanakumar 2010-05-10 14:19:04 Invalid message format Exception
Previous Message Kris Jurka 2010-05-07 16:00:49 Re: Bug: SimpleParameterList and resolved types