Quick Links

Re: Parallel copy

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc:	vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Parallel copy
Date:	2020-06-13 04:12:20
Message-ID:	CAA4eK1Jq6-=TYTj37XnWioaOkZ+nY0ipAuNyT=6AfU_0iPvqXA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi All,
>
> I've spent little bit of time going through the project discussion that has happened in this email thread and to start with I have few questions which I would like to put here:
>
> Q1) Are we also planning to read the input data in parallel or is it only about performing the multi-insert operation in parallel? AFAIU, the data reading part will be done by the leader process alone so no parallelism is involved there.
>

Yes, your understanding is correct.

> Q2) How are we going to deal with the partitioned tables?
>

I haven't studied the patch but my understanding is that we will
support parallel copy for partitioned tables with a few restrictions
as explained in my earlier email [1]. See, Case-2 (b) in the email.

> I mean will there be some worker process dedicated for each partition or how is it?

No, it the split is just based on the input, otherwise each worker
should insert as we would have done without any workers.

> Q3) Incase of toast tables, there is a possibility of having a single tuple in the input file which could be of a very big size (probably in GB) eventually resulting in a bigger file size. So, in this case, how are we going to decide the number of worker processes to be launched. I mean, although the file size is big, but the number of tuples to be processed is just one or few of them, so, can we decide the number of the worker processes to be launched based on the file size?
>

Yeah, such situations would be tricky, so we should have an option for
user to specify the number of workers.

> Q4) Who is going to process constraints (preferably the deferred constraint) that is supposed to be executed at the COMMIT time? I mean is it the leader process or the worker process or in such cases we won't be choosing the parallelism at all?
>

In the first version, we won't do parallelism for this. Again, see
one of my earlier email [1] where I have explained this and other
cases where we won't be supporting parallel copy.

> Q5) Do we have any risk of table bloating when the data is loaded in parallel. I am just asking this because incase of parallelism there would be multiple processes performing bulk insert into a table. There is a chance that the table file might get extended even if there is some space into the file being written into, but that space is locked by some other worker process and hence that might result in a creation of a new block for that table. Sorry, if I am missing something here.
>

Hmm, each worker will operate at page level, after first insertion,
the same worker will try to insert in the same page in which it has
inserted last, so there shouldn't be such a problem.

> Please note that I haven't gone through all the emails in this thread so there is a possibility that I might have repeated the question that has already been raised and answered here. If that is the case, I am sorry for that, but it would be very helpful if someone could point out that thread so that I can go through it. Thank you.
>

No problem, I understand sometimes it is difficult to go through each
and every email especially when the discussion is long. Anyway,
thanks for showing the interest in the patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Re: Parallel copy at 2020-06-12 11:27:40 from Ashutosh Sharma

Responses

Re: Parallel copy at 2020-06-15 11:09:04 from Bharath Rupireddy
Re: Parallel copy at 2020-06-15 14:11:03 from Ashutosh Sharma

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2020-06-13 04:19:17	Re: doc examples for pghandler
Previous Message	Pavel Stehule	2020-06-13 03:53:01	Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?