Re: Populating large tables with occasional bad values

From: "John T(dot) Dow" <john(at)johntdow(dot)com>
To: "Craig Ringer" <craig(at)postnewspapers(dot)com(dot)au>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>, "tivvpgsqljdbc(at)gtech-ua(dot)com" <tivvpgsqljdbc(at)gtech-ua(dot)com>
Subject: Re: Populating large tables with occasional bad values
Date: 2008-06-13 21:56:24
Message-ID: 200806132215.m5DMFRrO015727@web2.nidhog.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

I have a solution that I am happy with, yielding nearly a 10-fold improvement in speed.

I tried a quick and dirty experiment with different batch sizes with a small table of 750 rows.

Without using batches, the insert took 64 seconds. (Auto commit true for this and the other tests.)

Batch size 10, 13 seconds.

Batch size 100, 5 seconds.

This code was very simple -- if any error in the batch, discard the entire batch. This gives an idea of what performance would be like.

Of course, with a batch size of 10 and two errors, the number of rows loaded is short by 17 because two entire batches were discarded. With a batch size of 100, it's short by 150. (I also neglected to send the last, partial batch, this being just a quick and dirty experiment.)

Anyway, a batch size of 10 yields a performance improvement of 5. Batch size 100 it's 10.

Craig wrote:

>IMO the best solution with batches is to use batches to store your
>planned operations in a staging table using INSERTs that cannot fail
>under normal circumstances. Then issue a single PL/PgSQL call or a bulk
>UPDATE ... SELECT with an appropriate WHERE clause to apply the updates
>and catch problems.
>
>The approach you describe will work, but it'll be a bit ugly and not
>very fast. If failures are very rare you might find it to be OK, but
>your suggestion of using only 10 statements per batch suggests that
>failures aren't *that* rare.

I need reasonable performance, not the best that money can buy. My clients have tables with tens of thousands of rows, not hundreds or millions. Also, I want a general purpose solution that will work with no special knowledge of the table to be loaded -- just the info obtainable from the meta data (such as column names and types). A staging table such as you suggest could be created automatically (copy the columns but change all to char varying so they load). But then locating rows with bad values would be a messy bit of sql, although it could be generated by my code. Still, I'd have to bring over the rows (casting char varying to integer or whatever) that are good and then remove them, leaving the bad rows behind.

I'll clean up my dirty code, eg make sure it sends the last partial batch. I'll also have it iterate to insert the rows in any batch that's rejected. The simplest thing to do is simply resend the rows in the batch, one by one. A more elaborate thing to do would be to resend a partial batch, up to the point of the problem, skip the known problem row, then send another partial batch with the remaining rows. Keep doing that until all in the original batch have been successfully inserted or else written to the error log.

John

Browse pgsql-jdbc by date

  From Date Subject
Next Message cap20 2008-06-18 00:53:28 insert data into partitioned tables via Java application
Previous Message Craig Ringer 2008-06-13 19:10:01 Re: Populating large tables with occasional bad values