Quick Links

Practical error logging for very large COPY statements

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Practical error logging for very large COPY statements
Date:	2005-11-21 22:41:19
Message-ID:	1132612879.4959.474.camel@localhost.localdomain
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

If you've ever loaded 100 million rows, you'll know just how annoying it
is to find that you have a duplicate row somewhere in there. Experience
shows that there is always one, whatever oath the analyst swears
beforehand.

It's hard to find out which row is the duplicate, plus you've just
screwed up a big table. It needs a VACUUM, then a reload.

I'd like to find a way to handle this manual task programmatically.

What I'd like to do is add an ERRORTABLE clause to COPY. The main
problem is how we detect a duplicate row violation, yet prevent it from
aborting the transaction.

What I propose is to log uniqueness violations only when there is only a
single unique index on a table.

Flow of control would be to:

locate page of index where value should go
lock index block
_bt_check_unique, but don't error
if violation then insert row into ERRORTABLE
else
insert row into data block
insert row into unique index
unlock index block
do other indexes

Which is very similar code to the recently proposed MERGE logic.

With that logic, a COPY will run to completion, yet be able to report
the odd couple of unique index violations in found along the way. More
importantly we can then handle rows those with another program to locate
where those errors came from and resolve them.

In most cases with a single unique index, the index inserts are
rightmost index entries anyway, so there is scope here for an additional
optimisation: keep both index and data blocks locked across multiple row
inserts until either the unique index or the data block fills. Thats
better than taking a full table lock, since it allows concurrent access
to the rest of the table, but its also more efficient than continually
re-requesting the same blocks (which looks like about 10-15% saving on
performance from hash lookups, lock/unlock, etc).

Best Regards, Simon Riggs

Responses

Re: Practical error logging for very large COPY statements at 2005-11-22 00:05:22 from Tom Lane
Re: Practical error logging for very large COPY statements at 2005-11-22 02:00:41 from Christopher Kings-Lynne

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2005-11-21 22:42:39	Re: PostgreSQL 8.1.0 catalog corruption
Previous Message	Seneca Cunningham	2005-11-21 22:38:17	Should libedit be preferred to libreadline?