Populating large tables with occasional bad values

From: "John T(dot) Dow" <john(at)johntdow(dot)com>
To: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>
Subject: Populating large tables with occasional bad values
Date: 2008-06-11 14:52:57
Message-ID: 200806111511.m5BFBlK7026168@web2.nidhog.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

Question: why are inserts approximately 10 times slower over the Internet than locally?

Background information:

I have a Java program that inserts rows using data derived from a legacy database which can have occasional bad values (eg text in a numeric column) as well as duplicate primary key values.

Postgres rejects the offending rows and my Java program writes the offending insert statement and the error message to a log file. This allows the user to take action, perhaps cleaning the original data and reloading, perhaps entering a few rows manually, whatever.

In the case of duplicate key values, for certain tables the program is told to modify the key (eg appending "A" or "B") to make it unique. In that case, the original insert is an error but after one or two retries, a computed unique key allows it to be inserted.

While I understand that INSERT is not as fast as COPY, inserting is fast enough and provides the needed flexibility (especially for custom logic to alter duplicate keys).

However, inserting rows is only fast enough if the database is on the same computer as the Java program. For example, 200,000 rows can be inserted locally in under an hour. Over the Internet (using ssh tunneling with Trilead's code) it took over six hours. That's the problem.

I've tinkered some with setting autocommit off and doing commits every 100 inserts, but if there's an error the entire set is lost, not just the offending row.

Also, postgres apparently doesn't allow setting the transaction isolation to NONE. Am I wrong about that? If it did, would that help?

I fail to understand why this is so much slower over the Internet. Even if autocommit is on, that's more work for the server, I shouldn't think that it increases the network traffic. I must be entirely in the dark on this. Does autocommit cause jdbc code on my client computer to send a BEGIN transaction, the insert, and a COMMIT for each row?

John

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Craig Ringer 2008-06-11 16:06:01 Re: Populating large tables with occasional bad values
Previous Message Sanjay Thomas 2008-06-07 08:03:55 Error creating connection using postgres