Quick Links

Re: Parallel copy

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Alastair Turner <minion(at)decodable(dot)me>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Parallel copy
Date:	2020-02-15 12:32:06
Message-ID:	CAA4eK1LYedOF61yXBWFNW3rGoF3Sdt=WuBt182rH=6PKNrWLfA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion(at)decodable(dot)me> wrote:
>
> On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <minion(at)decodable(dot)me> wrote:
> > >
> ...
> > >
> > > Parsing rows from the raw input (the work done by CopyReadLine()) in a single process would accommodate line returns in quoted fields. I don't think there's a way of getting parallel workers to manage the in-quote/out-of-quote state required.
> > >
> >
> > AFAIU, the whole of this in-quote/out-of-quote state is manged inside
> > CopyReadLineText which will be done by each of the parallel workers,
> > something on the lines of what Thomas did in his patch [1].
> > Basically, we need to invent a mechanism to allocate chunks to
> > individual workers and then the whole processing will be done as we
> > are doing now except for special handling for partial tuples which I
> > have explained in my previous email. Am, I missing something here?
> >
> The problem case that I see is the chunk boundary falling in the
> middle of a quoted field where
> - The quote opens in chunk 1
> - The quote closes in chunk 2
> - There is an EoL character between the start of chunk 2 and the closing quote
>
> When the worker processing chunk 2 starts, it believes itself to be in
> out-of-quote state, so only data between the start of the chunk and
> the EoL is regarded as belonging to the partial line. From that point
> on the parsing of the rest of the chunk goes off track.
>
> Some of the resulting errors can be avoided by, for instance,
> requiring a quote to be preceded by a delimiter or EoL. That answer
> fails when fields end with EoL characters, which happens often enough
> in the wild.
>
> Recovering from an incorrect in-quote/out-of-quote state assumption at
> the start of parsing a chunk just seems like a hole with no bottom. So
> it looks to me like it's best done in a single process which can keep
> track of that state reliably.
>

Good point and I agree with you that having a single process would
avoid any such stuff. However, I will think some more on it and if
you/anyone else gets some idea on how to deal with this in a
multi-worker system (where we can allow each worker to read and
process the chunk) then feel free to share your thoughts.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Re: Parallel copy at 2020-02-15 10:38:07 from Alastair Turner

Responses

Re: Parallel copy at 2020-02-15 17:51:05 from David Fetter
Re: Parallel copy at 2020-02-16 06:51:37 from Andrew Dunstan
Re: Parallel copy at 2020-02-17 15:04:35 from Ants Aasma

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andreas Karlsson	2020-02-15 14:58:22	Re: Use LN_S instead of "ln -s" in Makefile
Previous Message	Alastair Turner	2020-02-15 10:38:07	Re: Parallel copy