Re: COPY issue(gsoc project)

From: longlong <asfnuts(at)gmail(dot)com>
To: NikhilS <nikkhils(at)gmail(dot)com>
Cc: "Neil Conway" <neilc(at)samurai(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY issue(gsoc project)
Date: 2008-03-20 06:00:31
Message-ID: d9f0a46b0803192300r2d1b9759sa6bf8670625f8265@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

hi NikhilS.

2008/3/14, NikhilS <nikkhils(at)gmail(dot)com>:
>
> Hi Longlong,
>
>
> > > i think this is a better idea.
> > from *NikhilS *
> > http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
> > But instead of using a per insert or a batch insert substraction, I am
> > thinking that we can start off a subtraction and continue it till we
> > encounter a failure. The moment an error is encountered, since we have the
> > offending (already in heap) tuple around, we can call a simple_heap_delete
> > on the same and commit (instead of aborting) this subtransaction after doing
> > some minor cleanup. This current input data row can also be logged into a
> > bad file. Recall that we need to only handle those errors in which the
> > simple_heap_insert is successful, but the index insertion or the after row
> > insert trigger causes an error. The rest of the load then can go ahead with
> > the start of a new subtransaction.
> > the simplest thing are often the best.
> > i think it's hard to implement or some other deficiency since you want
> > subtransaction or every "n" rows.
> >
>
>
> Yeah simpler things are often the best, but as folks are mentioning, we
> need a carefully thought out approach here. The reply from Tom to my posting
> there raises issues which need to be taken care of. Although I still think
> that if we carry out *sanity* checks before starting the load about presence
> of triggers, constrainsts, fkey constraints etc, if others do not have any
> issues with the approach, the simple_heap_delete idea should work in some
> cases. Although the term I used "after some minor cleanup" might need some
> thought too now that I think more of it..
>
> Also if Fkey checks or complex triggers are around, maybe we can fall back
> to a subtransaction per row insert too as a worse case measure..
>
> Regards,
> Nikhils
>
> --
> EnterpriseDB http://www.enterprisedb.com

i had seen every email of that thread before. i didn't think triggers,
constraints and fkey constraints would be a problem because at that time i
didn't know much about when and how these triggers were fired. :P
now i believe that all is about efficiency(of course error handling is a
base requirement). so i consider that a simple implement should be done, and
it can be a benchmark to measure other more advanced and efficient
implement. this will help to find a final solution.

subtransaction seems to be the only easy and feasible way to me.
as described by Neil earlier:

* in order to be sure that you can recover from an error, you
need to abort the current subtransaction

* starting and committing a subtransaction for every row of the COPY
would be too expensive

* therefore, start and commit a subtransaction for every "n" rows
of input. If an error occurs, you lose at most "n-1" rows of
valid input, which you need to backup and reinsert. There are
various approaches to choosing "n" (statically, based on the
error rate of previous batches in the same load, etc.).

i want to implement this feature as a google summer of code project. error
handling is the utmost target. test example will be generated for
efficiency analysis.

any suggestion is welcome.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2008-03-20 06:04:15 Re: Proposal: new large object API
Previous Message Tom Lane 2008-03-20 04:43:22 Re: [PATCHES] Text <-> C string