Re: Parallel copy

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>,Ants Aasma <ants(at)cybertec(dot)at>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>,vignesh C <vignesh21(at)gmail(dot)com>,Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>,Alastair Turner <minion(at)decodable(dot)me>,Thomas Munro <thomas(dot)munro(at)gmail(dot)com>,PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-04-09 18:55:47
Message-ID: 78C0107E-62F2-4F76-BFD8-34C73B716944@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On April 9, 2020 4:01:43 AM PDT, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>On Thu, Apr 9, 2020 at 3:55 AM Ants Aasma <ants(at)cybertec(dot)at> wrote:
>>
>> On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas(at)gmail(dot)com>
>wrote:
>>
>> > - The portion of the time that is used to split the lines is not
>> > easily parallelizable. That seems to be a fairly small percentage
>for
>> > a reasonably wide table, but it looks significant (13-18%) for a
>> > narrow table. Such cases will gain less performance and be limited
>to
>> > a smaller number of workers. I think we also need to be careful
>about
>> > files whose lines are longer than the size of the buffer. If we're
>not
>> > careful, we could get a significant performance drop-off in such
>> > cases. We should make sure to pick an algorithm that seems like it
>> > will handle such cases without serious regressions and check that a
>> > file composed entirely of such long lines is handled reasonably
>> > efficiently.
>>
>> I don't have a proof, but my gut feel tells me that it's
>fundamentally
>> impossible to ingest csv without a serial line-ending/comment
>> tokenization pass.

I can't quite see a way either. But even if it were, I have a hard time seeing parallelizing that path as the right thing.

>I think even if we try to do it via multiple workers it might not be
>better. In such a scheme, every worker needs to update the end
>boundaries and the next worker to keep a check if the previous has
>updated the end pointer. I think this can add a significant
>synchronization effort for cases where tuples are of 100 or so bytes
>which will be a common case.

It seems like it'd also have terrible caching and instruction level parallelism behavior. By constantly switching the process that analyzes boundaries, the current data will have to be brought into l1/register, rather than staying there.

I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably not even below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think timestamps - splitting on a granular basis will make indexing much more expensive. And have a lot more contention.

>> The current line splitting algorithm is terrible.
>> I'm currently working with some scientific data where on ingestion
>> CopyReadLineText() is about 25% on profiles. I prototyped a
>> replacement that can do ~8GB/s on narrow rows, more on wider ones.

We should really replace the entire copy parsing code. It's terrible.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-04-09 19:05:32 Re: Multiple FPI_FOR_HINT for the same block during killing btree index items
Previous Message Jeff Janes 2020-04-09 18:39:41 Re: BUG #16345: ts_headline does not find phrase matches correctly