Re: Parallel copy

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-18 10:29:36
Message-ID: CAA4eK1J6_qA=k-WciBvpDGBTHXrT6hKwwAzvfz_hdkMCUQwD9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 18, 2020 at 7:28 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in
> > On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan
> > <andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
> > > On 2/15/20 7:32 AM, Amit Kapila wrote:
> > > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion(at)decodable(dot)me> wrot> > So why not just forbid parallel copy in CSV
> > > mode, at least for now? I guess it depends on the actual use case. If we
> > > expect to be parallel loading humungous CSVs then that won't fly.
> > >
> >
> > I am not sure about this part. However, I guess we should at the very
> > least have some extendable solution that can deal with csv, otherwise,
> > we might end up re-designing everything if someday we want to deal
> > with CSV. One naive idea is that in csv mode, we can set up the
> > things slightly differently like the worker, won't start processing
> > the chunk unless the previous chunk is completely parsed. So each
> > worker would first parse and tokenize the entire chunk and then start
> > writing it. So, this will make the reading/parsing part serialized,
> > but writes can still be parallel. Now, I don't know if it is a good
> > idea to process in a different way for csv mode.
>
> In an extreme case, if we didn't see a QUOTE in a chunk, we cannot
> know the chunk is in a quoted section or not, until all the past
> chunks are parsed. After all we are forced to parse fully
> sequentially as far as we allow QUOTE.
>

Right, I think the benefits of this as compared to single reader idea
would be (a) we can save accessing shared memory for the most part of
the chunk (b) for non-csv mode, even the tokenization (finding line
boundaries) would also be parallel. OTOH, doing processing
differently for csv and non-csv mode might not be good.

> On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV,
> QUOTE '')" in order to signal that there's no quoted section in the
> file then all chunks would be fully concurrently parsable.
>

Yeah, if we can provide such an option, we can probably make parallel
csv processing equivalent to non-csv. However, users might not like
this as I think in some cases it won't be easier for them to tell
whether the file has quoted fields or not. I am not very sure of this
point.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-02-18 10:50:16 Re: [HACKERS] WAL logging problem in 9.4.3?
Previous Message Amit Kapila 2020-02-18 10:20:40 Re: Parallel copy