Re: Populating large tables with occasional bad values

From: "John T(dot) Dow" <john(at)johntdow(dot)com>
To: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>, "tivvpgsqljdbc(at)gtech-ua(dot)com" <tivvpgsqljdbc(at)gtech-ua(dot)com>
Subject: Re: Populating large tables with occasional bad values
Date: 2008-06-13 14:59:54
Message-ID: 200806131518.m5DFIrSD018924@web2.nidhog.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

I have tried using a batch and it looked good at first but I don't think it will work with Postgres as cleanly as you'd like.

First, set autocommit false.

Second, addBatch many times for the insert statements.

Third, executeBatch.

Fourth, the exception BatchUpdateException is thrown. The exception reports which insert statement had a problem. getNextException() gives more details (duplicate key). Only one problem is reported. getUpdateCounts() on the exception reports what we already know, that the statements up until this point all inserted without problem.

Fifth - now what? I tried to commit at this point, hoping that the first insert statements would be commited, but they are not. Actually, I was hoping that the entire batch would be processed and then I could see which individual statements didn't insert (using the update counts) and then redo them individually, but I think that is wishful thinking.

Best possible solution with batches? Submit the batch, learn who many were accepted before the first error, resubmit just those. Process the problem statement by itself. Repeat with the statements left in the original batch until none are left, then create a new batch and begin anew. If a batch, or portion thereof, has no exception, then of course commit.

For this, since portions of batches will be sent twice, the batch shouldn't be too large. Even 10 statements in a batch would substantially reduce the overhead, certainly no need to do thousands.

In the posting I quote below, the implication is that you can learn multiple statements in the batch that failed, but I didn't see that happening. It seems that posgres just quits at the first problem.

JOhn

On Thu, 12 Jun 2008 14:35:30 +0300, tivvpgsqljdbc(at)gtech-ua(dot)com wrote:

>John T. Dow написав(ла):
>>> How about batches? Should not they help you in this case?
>>>
>>
>>
>> WHat happens if the 23rd and 59th rows in a batch of 100 have a problem, such as an invalid numeric value or a duplicate key?
>>
>> Can the other 98 rows be committed?
>>
>> I haven't tried this.
>>
>
>In postgresql no (may be yes with autocommit turned on - did not try).
>Because postgresql marks transaction as rollback-only on first problem.
>The one thing you can do is to detect which records are wrong in the
>batch (say, #2 and #55) and redo the batch without failing records
>(with/without appending batch with new records). This would make server
>load higher, but would give you only two roundtrips instead of 100
>roundtrips. Actually if every batch will have failing records, your
>server will has two times more transactions (this is maximum).
>
>

Responses

Browse pgsql-jdbc by date

  From Date Subject
Next Message Craig Ringer 2008-06-13 19:10:01 Re: Populating large tables with occasional bad values
Previous Message tivvpgsqljdbc 2008-06-12 08:56:02 Re: Populating large tables with occasional bad values