Re: Parallel copy

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ants Aasma <ants(at)cybertec(dot)at>, vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-04-15 11:15:39
Message-ID: CAA4eK1+AjvU-+tzs5Ng2q94b6cw49gZsTPQMisjJ5iPaVEV8yQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 15, 2020 at 1:10 AM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> Hence, I was trying to think whether we can leverage this idea for
> implementing parallel COPY in PG. We can design an algorithm similar
> to parallel hash-join where the workers pass through different phases.
> 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
> the small stats about each chunk in the shared memory. If the shared
> memory is full, go to phase 2.
> 2. Phase 2 - Allow a single worker to process the stats and decide the
> actual chunk boundaries so that no tuple spans across two different
> chunks. Go to phase 3.
>
> 3. Phase 3 - Each worker picks one adjusted chunk, parse and process
> tuples from the same. Once done with one chunk, it picks the next one
> and so on.
>
> 4. If there are still some unread contents, go back to phase 1.
>
> We can probably use separate workers for phase 1 and phase 3 so that
> they can work concurrently.
>
> Advantages:
> 1. Each worker spends some significant time in each phase. Gets
> benefit of the instruction cache - at least in phase 1.
> 2. It also has the same advantage of parallel hash join - fast workers
> get to work more.
> 3. We can extend this solution for reading data from STDIN. Of course,
> the phase 1 and phase 2 must be performed by the leader process who
> can read from the socket.
>
> Disadvantages:
> 1. Surely doesn't work if we don't have enough shared memory.
> 2. Probably, this approach is just impractical for PG due to certain
> limitations.
>

As I understand this, it needs to parse the lines twice (second time
in phase-3) and till the first two phases are over, we can't start the
tuple processing work which is done in phase-3. So even if the
tokenization is done a bit faster but we will lose some on processing
the tuples which might not be an overall win and in fact, it can be
worse as compared to the single reader approach being discussed.
Now, if the work done in tokenization is a major (or significant)
portion of the copy then thinking of such a technique might be useful
but that is not the case as seen in the data shared above (the
tokenize time is very less as compared to data processing time) in
this email.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-15 13:23:30 Re: where should I stick that backup?
Previous Message Andrey M. Borodin 2020-04-15 10:58:05 Re: Allow pg_read_all_stats to read pg_stat_progress_*