Re: Parallel copy

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-17 11:19:22
Message-ID: CAA4eK1LyAyPCtBk4rkwomeT6=yTse5qWws-7i9EFwnUFZhvu5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan
<andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
> On 2/15/20 7:32 AM, Amit Kapila wrote:
> > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion(at)decodable(dot)me> wrote:
> >>>
> >> The problem case that I see is the chunk boundary falling in the
> >> middle of a quoted field where
> >> - The quote opens in chunk 1
> >> - The quote closes in chunk 2
> >> - There is an EoL character between the start of chunk 2 and the closing quote
> >>
> >> When the worker processing chunk 2 starts, it believes itself to be in
> >> out-of-quote state, so only data between the start of the chunk and
> >> the EoL is regarded as belonging to the partial line. From that point
> >> on the parsing of the rest of the chunk goes off track.
> >>
> >> Some of the resulting errors can be avoided by, for instance,
> >> requiring a quote to be preceded by a delimiter or EoL. That answer
> >> fails when fields end with EoL characters, which happens often enough
> >> in the wild.
> >>
> >> Recovering from an incorrect in-quote/out-of-quote state assumption at
> >> the start of parsing a chunk just seems like a hole with no bottom. So
> >> it looks to me like it's best done in a single process which can keep
> >> track of that state reliably.
> >>
> > Good point and I agree with you that having a single process would
> > avoid any such stuff. However, I will think some more on it and if
> > you/anyone else gets some idea on how to deal with this in a
> > multi-worker system (where we can allow each worker to read and
> > process the chunk) then feel free to share your thoughts.
> >
>
>
> IIRC, in_quote only matters here in CSV mode (because CSV fields can
> have embedded newlines).
>

AFAIU, that is correct.

> So why not just forbid parallel copy in CSV
> mode, at least for now? I guess it depends on the actual use case. If we
> expect to be parallel loading humungous CSVs then that won't fly.
>

I am not sure about this part. However, I guess we should at the very
least have some extendable solution that can deal with csv, otherwise,
we might end up re-designing everything if someday we want to deal
with CSV. One naive idea is that in csv mode, we can set up the
things slightly differently like the worker, won't start processing
the chunk unless the previous chunk is completely parsed. So each
worker would first parse and tokenize the entire chunk and then start
writing it. So, this will make the reading/parsing part serialized,
but writes can still be parallel. Now, I don't know if it is a good
idea to process in a different way for csv mode.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-02-17 11:23:15 Re: Orphaned relation files after crash recovery
Previous Message Thomas Munro 2020-02-17 11:04:54 Re: Orphaned relation files after crash recovery