Practical error logging for very large COPY statements

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Practical error logging for very large COPY statements
Date: 2005-11-21 22:41:19
Message-ID: 1132612879.4959.474.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

If you've ever loaded 100 million rows, you'll know just how annoying it
is to find that you have a duplicate row somewhere in there. Experience
shows that there is always one, whatever oath the analyst swears
beforehand.

It's hard to find out which row is the duplicate, plus you've just
screwed up a big table. It needs a VACUUM, then a reload.

I'd like to find a way to handle this manual task programmatically.

What I'd like to do is add an ERRORTABLE clause to COPY. The main
problem is how we detect a duplicate row violation, yet prevent it from
aborting the transaction.

What I propose is to log uniqueness violations only when there is only a
single unique index on a table.

Flow of control would be to:

locate page of index where value should go
lock index block
_bt_check_unique, but don't error
if violation then insert row into ERRORTABLE
else
insert row into data block
insert row into unique index
unlock index block
do other indexes

Which is very similar code to the recently proposed MERGE logic.

With that logic, a COPY will run to completion, yet be able to report
the odd couple of unique index violations in found along the way. More
importantly we can then handle rows those with another program to locate
where those errors came from and resolve them.

In most cases with a single unique index, the index inserts are
rightmost index entries anyway, so there is scope here for an additional
optimisation: keep both index and data blocks locked across multiple row
inserts until either the unique index or the data block fills. Thats
better than taking a full table lock, since it allows concurrent access
to the rest of the table, but its also more efficient than continually
re-requesting the same blocks (which looks like about 10-15% saving on
performance from hash lookups, lock/unlock, etc).

Best Regards, Simon Riggs

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-11-21 22:42:39 Re: PostgreSQL 8.1.0 catalog corruption
Previous Message Seneca Cunningham 2005-11-21 22:38:17 Should libedit be preferred to libreadline?