Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

From: Alex K <kondratov(dot)aleksey(at)gmail(dot)com>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Стас <stas(dot)kelvich(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: GSOC'17 project introduction: Parallel COPY execution with errors handling
Date: 2017-04-10 15:39:16
Message-ID: CADfU8WxKzLun7X0o_HZ75p07JVanvXpkym6YmjrGX1n9CzNz6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alexander!

I've missed your reply, since proposal submission deadline have passed last
Monday and I didn't check hackers mailing list too frequently.

(1) It seems that starting new subtransaction at step 4 is not necessary.
We can just gather all error lines in one pass and at the end of input
start the only one additional subtransaction with all safe-lines at once:
[1, ..., k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error
line number.

But assuming that the only livable use-case is when number of errors is
relatively small compared to the total rows number, because if the input is
in totally inconsistent format, then it seems useless to import it into the
db. Thus, it is not 100% clear for me, would it be any real difference in
performance, if one starts new subtransaction at step 4 or not.

(2) Hmm, good question. As far as I know it is impossible to get stdin
input size, thus it is impossible to distribute stdin directly to the
parallel workers. The first approach which comes to the mind is to store
stdin input in any kind of buffer/query and next read it in parallel by
workers. The question is how it will perform in the case of large file, I
guess poor, at least from the memory consumption perspective. But would
parallel execution still be faster is the next question.

Alexey

On Thu, Apr 6, 2017 at 4:47 PM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> Hi, Alexey!
>
> On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <
> kondratov(dot)aleksey(at)gmail(dot)com> wrote:
>
>> Thank you for your responses and valuable comments!
>>
>> I have written draft proposal https://docs.google.c
>> om/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
>>
>> It seems that COPY currently is able to return first error line and error
>> type (extra or missing columns, type parse error, etc).
>> Thus, the approach similar to the Stas wrote should work and, being
>> optimised for a small number of error rows, should not
>> affect COPY performance in such case.
>>
>> I will be glad to receive any critical remarks and suggestions.
>>
>
> I've following questions about your proposal.
>
> 1. Suppose we have to insert N records
>> 2. We create subtransaction with these N records
>> 3. Error is raised on k-th line
>> 4. Then, we can safely insert all lines from 1st and till (k - 1)
>>
> 5. Report, save to errors table or silently drop k-th line
>> 6. Next, try to insert lines from (k + 1) till N with another
>> subtransaction
>> 7. Repeat until the end of file
>
>
> Do you assume that we start new subtransaction in 4 since subtransaction
> we started in 2 is rolled back?
>
> I am planning to use background worker processes for parallel COPY
>> execution. Each process will receive equal piece of the input file. Since
>> file is splitted by size not by lines, each worker will start import from
>> the first new line to do not hit a broken line.
>
>
> I think that situation when backend is directly reading file during COPY
> is not typical. More typical case is \copy psql command. In that case
> "COPY ... FROM stdin;" is actually executed while psql is streaming the
> data.
> How can we apply parallel COPY in this case?
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ildar Musin 2017-04-10 15:55:04 Repetitive code in RI triggers
Previous Message Robert Haas 2017-04-10 15:38:05 Re: Merge join for GiST