Re: Parallel copy

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: amit(dot)kapila16(at)gmail(dot)com
Cc: andrew(dot)dunstan(at)2ndquadrant(dot)com, minion(at)decodable(dot)me, thomas(dot)munro(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Parallel copy
Date: 2020-02-18 01:57:03
Message-ID: 20200218.105703.1313615968227299620.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in
> On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan
> <andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
> > On 2/15/20 7:32 AM, Amit Kapila wrote:
> > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion(at)decodable(dot)me> wrot> > So why not just forbid parallel copy in CSV
> > mode, at least for now? I guess it depends on the actual use case. If we
> > expect to be parallel loading humungous CSVs then that won't fly.
> >
>
> I am not sure about this part. However, I guess we should at the very
> least have some extendable solution that can deal with csv, otherwise,
> we might end up re-designing everything if someday we want to deal
> with CSV. One naive idea is that in csv mode, we can set up the
> things slightly differently like the worker, won't start processing
> the chunk unless the previous chunk is completely parsed. So each
> worker would first parse and tokenize the entire chunk and then start
> writing it. So, this will make the reading/parsing part serialized,
> but writes can still be parallel. Now, I don't know if it is a good
> idea to process in a different way for csv mode.

In an extreme case, if we didn't see a QUOTE in a chunk, we cannot
know the chunk is in a quoted section or not, until all the past
chunks are parsed. After all we are forced to parse fully
sequentially as far as we allow QUOTE.

On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV,
QUOTE '')" in order to signal that there's no quoted section in the
file then all chunks would be fully concurrently parsable.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-02-18 02:39:31 Re: Parallel copy
Previous Message Michael Paquier 2020-02-18 01:55:59 Re: tiny documentation fix