Re: Parallel COPY FROM execution

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Alex K <kondratov(dot)aleksey(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>, Anastasia Lubennikova <lubennikovaAV(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Parallel COPY FROM execution
Date: 2017-06-30 12:35:46
Message-ID: CAFj8pRCH6X+RwZOg_a232-d76MzVvhfWHBFg-aeuHsNsanmwFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2017-06-30 14:23 GMT+02:00 Alex K <kondratov(dot)aleksey(at)gmail(dot)com>:

> Greetings pgsql-hackers,
>
> I am a GSOC student this year, my initial proposal has been discussed
> in the following thread
> https://www.postgresql.org/message-id/flat/7179F2FD-49CE-
> 4093-AE14-1B26C5DFB0DA%40gmail.com
>
> Patch with COPY FROM errors handling seems to be quite finished, so
> I have started thinking about parallelism in COPY FROM, which is the next
> point in my proposal.
>
> In order to understand are there any expensive calls in COPY, which
> can be executed in parallel, I did a small research. First, please, find
> flame graph of the most expensive copy.c calls during the 'COPY FROM file'
> attached (copy_from.svg). It reveals, that inevitably serial operations
> like
> CopyReadLine (<15%), heap_multi_insert (~15%) take less than 50% of
> time in summary, while remaining operations like heap_form_tuple and
> multiple checks inside NextCopyFrom probably can be executed well in
> parallel.
>
> Second, I have compared an execution time of 'COPY FROM a single large
> file (~300 MB, 50000000 lines)' vs. 'COPY FROM four equal parts of the
> original file executed in the four parallel processes'. Though it is a
> very rough test, it helps to obtain an overall estimation:
>
> Serial:
> real 0m56.571s
> user 0m0.005s
> sys 0m0.006s
>
> Parallel (x4):
> real 0m22.542s
> user 0m0.015s
> sys 0m0.018s
>
> Thus, it results in a ~60% performance boost per each x2 multiplication of
> parallel processes, which is consistent with the initial estimation.
>
>
the important use case is big table with lot of indexes. Did you test
similar case?

Regards

Pavel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-06-30 12:55:55 Re: CREATE COLLATION definitional questions for ICU
Previous Message Alex K 2017-06-30 12:23:02 Parallel COPY FROM execution