Re: COPY issue(gsoc project)

From: longlong <asfnuts(at)gmail(dot)com>
To: "Neil Conway" <neilc(at)samurai(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY issue(gsoc project)
Date: 2008-03-14 02:45:34
Message-ID: d9f0a46b0803131945x691c9626ma673628340b2e012@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2008/3/12, Neil Conway <neilc(at)samurai(dot)com>:
>
> I don't see why creating index entries in bulk has anything to do with
> COPY vs. INSERT: if a lot of rows are being loaded into the table in a
> single command, it would be a win to create the index entries in bulk,
> regardless of whether COPY or INSERT ... SELECT is being used.
>
> In any case, the "create indexes in bulk" hasn't actually been
> implemented in mainline Postgres...
>
> I mentioned pgloader just as an example of an existing implementation of
> the "error recovery in COPY" idea. The issues with doing an
> implementation of error recovery in the backend that I see are:
>
> * in order to be sure that you can recover from an error, you
> need to abort the current subtransaction
>
> * starting and committing a subtransaction for every row of the COPY
> would be too expensive
>
> * therefore, start and commit a subtransaction for every "n" rows
> of input. If an error occurs, you lose at most "n-1" rows of
> valid input, which you need to backup and reinsert. There are
> various approaches to choosing "n" (statically, based on the
> error rate of previous batches in the same load, etc.).
>
>
> -Neil
>
>
> i think this is a better idea.
from *NikhilS *
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am
thinking that we can start off a subtraction and continue it till we
encounter a failure. The moment an error is encountered, since we have the
offending (already in heap) tuple around, we can call a simple_heap_delete
on the same and commit (instead of aborting) this subtransaction after doing
some minor cleanup. This current input data row can also be logged into a
bad file. Recall that we need to only handle those errors in which the
simple_heap_insert is successful, but the index insertion or the after row
insert trigger causes an error. The rest of the load then can go ahead with
the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other deficiency since you want
subtransaction or every "n" rows.

you have mentioned that the 'n' can be changed according when and where the
error happened in thread *"Re: VLDB Features" .this is like some *mechanisms
in tcp Congestion Control.but you can't ignore the time wasted in
subtransaction before it encounters an error especially when the 'n' is big.

i don't know the cost of a subtransaction(begin and commit) and an copy
line reading(CopyReadLine()) exactly. so i just calculate the number of
subtransactions.
f(n)=(1-(1-p)^n)**m+m/n
m is the number of lines. n is the subtransaction lines. p is the
possibility of each row encounters an error.
*big 'n' can reduce the number of subtransaction(m/n), but also increase the
*possibility of *having a error. unless the p is extremely small, choosing a
big 'n' is a big mistake.
in fact the errors always get together (my experience), the situation may be
a little better.

however, the idea(from NikhilS) that i start with is the perfect solution.
yes i have seen in the email archives in thread *"Re: VLDB Features" and i
notice* some disagreements about commit problems and etc. this won't be a
problem since so many similar problems have been solved in pg.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message info 2008-03-14 02:51:12 File your tax with NETFILE today
Previous Message ITAGAKI Takahiro 2008-03-14 02:06:45 PROC_VACUUM_FOR_WRAPAROUND doesn't work expectedly