Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Alexey Kondratov <kondratov(dot)aleksey(at)gmail(dot)com>
Cc: Стас <stas(dot)kelvich(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: GSOC'17 project introduction: Parallel COPY execution with errors handling
Date: 2017-04-06 13:47:46
Message-ID: CAPpHfdvV8FC67Emeb9XJpULkMOtrJiyC0dGL7FMSyRZ2SLk=5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <
kondratov(dot)aleksey(at)gmail(dot)com> wrote:

> Thank you for your responses and valuable comments!
>
> I have written draft proposal https://docs.google.com/document/d/1Y4mc_
> PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
>
> It seems that COPY currently is able to return first error line and error
> type (extra or missing columns, type parse error, etc).
> Thus, the approach similar to the Stas wrote should work and, being
> optimised for a small number of error rows, should not
> affect COPY performance in such case.
>
> I will be glad to receive any critical remarks and suggestions.
>

I've following questions about your proposal.

1. Suppose we have to insert N records
> 2. We create subtransaction with these N records
> 3. Error is raised on k-th line
> 4. Then, we can safely insert all lines from 1st and till (k - 1)
>
5. Report, save to errors table or silently drop k-th line
> 6. Next, try to insert lines from (k + 1) till N with another
> subtransaction
> 7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction we
started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY
> execution. Each process will receive equal piece of the input file. Since
> file is splitted by size not by lines, each worker will start import from
> the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY is
not typical. More typical case is \copy psql command. In that case "COPY
... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2017-04-06 14:37:25 Re: LWLock optimization for multicore Power machines
Previous Message Kevin Grittner 2017-04-06 13:31:37 Re: [HACKERS] [GSoC] Push-based query executor discussion