Re: Populating large tables with occasional bad values

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: "John T(dot) Dow" <john(at)johntdow(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>, "tivvpgsqljdbc(at)gtech-ua(dot)com" <tivvpgsqljdbc(at)gtech-ua(dot)com>
Subject: Re: Populating large tables with occasional bad values
Date: 2008-06-13 19:10:01
Message-ID: 4852C609.3060109@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

John T. Dow wrote:

> Fifth - now what? I tried to commit at this point, hoping that the
> first insert statements would be commited, but they are not. Actually, I
> was hoping that the entire batch would be processed and then I could see
> which individual statements didn't insert (using the update counts) and
> then redo them individually, but I think that is wishful thinking.

The first error causes a transaction rollback. The only way the JDBC
driver could skip the failed operation and continue would be to wrap
each statement in a savepoint, and if there's an error issue a rollback
to the last savepoint. That's not ideal from a performance point of view.

> Best possible solution with batches? Submit the batch, learn who many
> were accepted before the first error, resubmit just those. Process the
> problem statement by itself. Repeat with the statements left in the
> original batch until none are left, then create a new batch and begin
> anew. If a batch, or portion thereof, has no exception, then of course
> commit.

IMO the best solution with batches is to use batches to store your
planned operations in a staging table using INSERTs that cannot fail
under normal circumstances. Then issue a single PL/PgSQL call or a bulk
UPDATE ... SELECT with an appropriate WHERE clause to apply the updates
and catch problems.

The approach you describe will work, but it'll be a bit ugly and not
very fast. If failures are very rare you might find it to be OK, but
your suggestion of using only 10 statements per batch suggests that
failures aren't *that* rare.

> In the posting I quote below, the implication is that you can learn
> multiple statements in the batch that failed, but I didn't see that
> happening.

AFAIK that's database specific. The docs I read suggested that DBs are
permitted to stop processing and return info on the first bad row, or to
continue processing and return info on all rows.

As PostgreSQL cannot just keep going after an error within a transaction
unless you use savepoints, it choses to do the former.

It might be cool if Pg supported an automatic savepoint mode where every
statement updated an implicit savepoint snapshot. If the statement
failed you could ROLLBACK TO LAST STATEMENT. If possible the savepoint
would be replaced with each statement so there'd be no ever-growing set
of savepoints to worry about. With this capability way you could use
proper transactional isolation but also achieve some error handling
within the transaction. Personally, though, I either restructure my SQL
to avoid the potential errors or use PL/PgSQL to handle them. The app
has to be capable of reissuing failed transactions anyway, so sometimes
an app-level transaction do-over is quite enough. There might be
situations in which that hypothetical feature could be handy, though.

> It seems that posgres just quits at the first problem.

Yes, by design. In a transaction if a statement fails then without a
savepoint in place there is, AFAIK, no way to just pretend the bad
statemnet was never issued.

I don't know if it'll keep going if you enable autocommit. It might; it
depends on how the JDBC driver does batches and how the server processes
them. Your performance will suffer with autocommit on - though not as
badly as when you're paying the high round trip latencies - and you
won't be able to roll back if something nasty happens. If you don't care
about things like automatic rollback on network dropout you should
probably test batches with autocommit on and see how they behave.

--
Craig Ringer

In response to

Browse pgsql-jdbc by date

  From Date Subject
Next Message John T. Dow 2008-06-13 21:56:24 Re: Populating large tables with occasional bad values
Previous Message John T. Dow 2008-06-13 14:59:54 Re: Populating large tables with occasional bad values