Re: Bulkloading using COPY - ignore duplicates?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Zeugswetter Andreas SB SD" <ZeugswetterA(at)spardat(dot)at>, "Lee Kindness" <lkindness(at)csl(dot)co(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-10-01 17:21:41
Message-ID: 24979.1001956901@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I said:
> "Zeugswetter Andreas SB SD" <ZeugswetterA(at)spardat(dot)at> writes:
>> I thought that the problem was, that you cannot simply skip the
>> insert, because at that time the tuple (pointer) might have already
>> been successfully inserted into an other index/heap, and thus this was
>> only sanely possible with savepoints/undo.

> Hmm, good point. If we don't error out the transaction then that tuple
> would become good when we commit. This is nastier than it appears.

On further thought, I think it *would* be possible to do this without
savepoints, but it'd take some restructuring of the index AM API.
What'd have to happen is that a unique index could not raise an elog
ERROR when it detects a uniqueness conflict. Instead, it'd return a
uniqueness-conflict indication back to its caller. This would have
to propagate up to the level of the executor. At that point we'd make
the choice of whether to raise an error or not. If not, we'd need to
modify the just-created tuple to mark it deleted by the current
transaction. We can't remove it, since that would leave any
already-created entries in other indexes pointing to nothing. But
marking it deleted by the same xact and command ID that inserted it
would leave things in a valid state until VACUUM comes along to do the
janitorial work.

To support backoff in the case of a conflict during UPDATE, it'd also be
necessary to un-mark the prior version of the tuple, which we'd already
marked as deleted. This might create some concurrency issues in case
there are other updaters waiting to see if we commit or not. (The same
issue would arise for savepoint-based undo, though.) We might want to
punt on that part for now.

The effects don't stop propagating there, either. The decision not to
insert the tuple must be reported up still further, so that the executor
knows not to run any AFTER INSERT/UPDATE triggers and knows not to count
the tuple as inserted/updated for the command completion report.

In short, quite a lot of code to touch to make this happen ...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-10-01 17:31:35 Re: developer's faq
Previous Message Vince Vielhaber 2001-10-01 16:59:31 Re: developer's faq