Re: Parallel copy

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-06-15 14:11:03
Message-ID: CAE9k0Pnf+fdcMa1Geq-OQTeJ3sHrWfDJ3cG4nGoPcbahmqNk8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Amit for the clarifications. Regarding partitioned table, one of the
question was - if we are loading data into a partitioned table using COPY
command, then the input file would contain tuples for different tables
(partitions) unlike the normal table case where all the tuples in the input
file would belong to the same table. So, in such a case, how are we going
to accumulate tuples into the DSM? I mean will the leader process check
which tuple needs to be routed to which partition and accordingly
accumulate them into the DSM. For e.g. let's say in the input data file we
have 10 tuples where the 1st tuple belongs to partition1, 2nd belongs to
partition2 and likewise. So, in such cases, will the leader process
accumulate all the tuples belonging to partition1 into one DSM and tuples
belonging to partition2 into some other DSM and assign them to the worker
process or we have taken some other approach to handle this scenario?

Further, I haven't got much time to look into the links that you have
shared in your previous response. Will have a look into those and will also
slowly start looking into the patches as and when I get some time. Thank
you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Sat, Jun 13, 2020 at 9:42 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
> wrote:
> >
> > Hi All,
> >
> > I've spent little bit of time going through the project discussion that
> has happened in this email thread and to start with I have few questions
> which I would like to put here:
> >
> > Q1) Are we also planning to read the input data in parallel or is it
> only about performing the multi-insert operation in parallel? AFAIU, the
> data reading part will be done by the leader process alone so no
> parallelism is involved there.
> >
>
> Yes, your understanding is correct.
>
> > Q2) How are we going to deal with the partitioned tables?
> >
>
> I haven't studied the patch but my understanding is that we will
> support parallel copy for partitioned tables with a few restrictions
> as explained in my earlier email [1]. See, Case-2 (b) in the email.
>
> > I mean will there be some worker process dedicated for each partition or
> how is it?
>
> No, it the split is just based on the input, otherwise each worker
> should insert as we would have done without any workers.
>
> > Q3) Incase of toast tables, there is a possibility of having a single
> tuple in the input file which could be of a very big size (probably in GB)
> eventually resulting in a bigger file size. So, in this case, how are we
> going to decide the number of worker processes to be launched. I mean,
> although the file size is big, but the number of tuples to be processed is
> just one or few of them, so, can we decide the number of the worker
> processes to be launched based on the file size?
> >
>
> Yeah, such situations would be tricky, so we should have an option for
> user to specify the number of workers.
>
> > Q4) Who is going to process constraints (preferably the deferred
> constraint) that is supposed to be executed at the COMMIT time? I mean is
> it the leader process or the worker process or in such cases we won't be
> choosing the parallelism at all?
> >
>
> In the first version, we won't do parallelism for this. Again, see
> one of my earlier email [1] where I have explained this and other
> cases where we won't be supporting parallel copy.
>
> > Q5) Do we have any risk of table bloating when the data is loaded in
> parallel. I am just asking this because incase of parallelism there would
> be multiple processes performing bulk insert into a table. There is a
> chance that the table file might get extended even if there is some space
> into the file being written into, but that space is locked by some other
> worker process and hence that might result in a creation of a new block for
> that table. Sorry, if I am missing something here.
> >
>
> Hmm, each worker will operate at page level, after first insertion,
> the same worker will try to insert in the same page in which it has
> inserted last, so there shouldn't be such a problem.
>
> > Please note that I haven't gone through all the emails in this thread so
> there is a possibility that I might have repeated the question that has
> already been raised and answered here. If that is the case, I am sorry for
> that, but it would be very helpful if someone could point out that thread
> so that I can go through it. Thank you.
> >
>
> No problem, I understand sometimes it is difficult to go through each
> and every email especially when the discussion is long. Anyway,
> thanks for showing the interest in the patch.
>
> [1] -
> https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-06-15 14:24:22 Re: factorial of negative numbers
Previous Message Bruce Momjian 2020-06-15 14:10:32 Re: create database with template doesn't copy database ACL