Re: Practical error logging for very large COPY statements

From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Practical error logging for very large COPY statements
Date: 2005-11-22 02:00:41
Message-ID: 43827BC9.7050105@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Seems similar to the pgloader project on pgfoundry.org.

Chris

Simon Riggs wrote:
> If you've ever loaded 100 million rows, you'll know just how annoying it
> is to find that you have a duplicate row somewhere in there. Experience
> shows that there is always one, whatever oath the analyst swears
> beforehand.
>
> It's hard to find out which row is the duplicate, plus you've just
> screwed up a big table. It needs a VACUUM, then a reload.
>
> I'd like to find a way to handle this manual task programmatically.
>
> What I'd like to do is add an ERRORTABLE clause to COPY. The main
> problem is how we detect a duplicate row violation, yet prevent it from
> aborting the transaction.
>
> What I propose is to log uniqueness violations only when there is only a
> single unique index on a table.
>
> Flow of control would be to:
>
> locate page of index where value should go
> lock index block
> _bt_check_unique, but don't error
> if violation then insert row into ERRORTABLE
> else
> insert row into data block
> insert row into unique index
> unlock index block
> do other indexes
>
> Which is very similar code to the recently proposed MERGE logic.
>
> With that logic, a COPY will run to completion, yet be able to report
> the odd couple of unique index violations in found along the way. More
> importantly we can then handle rows those with another program to locate
> where those errors came from and resolve them.
>
>
>
> In most cases with a single unique index, the index inserts are
> rightmost index entries anyway, so there is scope here for an additional
> optimisation: keep both index and data blocks locked across multiple row
> inserts until either the unique index or the data block fills. Thats
> better than taking a full table lock, since it allows concurrent access
> to the rest of the table, but its also more efficient than continually
> re-requesting the same blocks (which looks like about 10-15% saving on
> performance from hash lookups, lock/unlock, etc).
>
> Best Regards, Simon Riggs
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Kirkwood 2005-11-22 02:38:34 Use of 8192 as BLCKSZ in xlog.c
Previous Message Tom Lane 2005-11-22 01:50:58 Re: PostgreSQL 8.1.0 catalog corruption