Re: Parallel copy

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ants Aasma <ants(at)cybertec(dot)at>, vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-04-09 11:49:06
Message-ID: CAFiTN-tzi3hUMrC_4iGZbv-g+j59UOpvFCWEGq3C4+jYx1=RpA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 9, 2020 at 1:00 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants(at)cybertec(dot)at> wrote:
> > I think the element based approach and requirement that all tuples fit
> > into the queue makes things unnecessarily complex. The approach I
> > detailed earlier allows for tuples to be bigger than the buffer. In
> > that case a worker will claim the long tuple from the ring queue of
> > tuple start positions, and starts copying it into its local line_buf.
> > This can wrap around the buffer multiple times until the next start
> > position shows up. At that point this worker can proceed with
> > inserting the tuple and the next worker will claim the next tuple.
> >
> > This way nothing needs to be resized, there is no risk of a file with
> > huge tuples running the system out of memory because each element will
> > be reallocated to be huge and the number of elements is not something
> > that has to be tuned.
>
> +1. This seems like the right way to do it.
>
> > > We had a couple of options for the way in which queue elements can be stored.
> > > Option 1: Each element (DSA chunk) will contain tuples such that each
> > > tuple will be preceded by the length of the tuple. So the tuples will
> > > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> > > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
> > > tuples (tuple-1), (tuple-2), ..... And we will have a second
> > > ring-buffer which contains a start-offset or length of each tuple. The
> > > old design used to generate one tuple of data and process tuple by
> > > tuple. In the new design, the server will generate multiple tuples of
> > > data per queue element. The worker will then process data tuple by
> > > tuple. As we are processing the data tuple by tuple, I felt both of
> > > the options are almost the same. However Design1 was chosen over
> > > Design 2 as we can save up on some space that was required by another
> > > variable in each element of the queue.
> >
> > With option 1 it's not possible to read input data into shared memory
> > and there needs to be an extra memcpy in the time critical sequential
> > flow of the leader. With option 2 data could be read directly into the
> > shared memory buffer. With future async io support, reading and
> > looking for tuple boundaries could be performed concurrently.
>
> But option 2 still seems significantly worse than your proposal above, right?
>
> I really think we don't want a single worker in charge of finding
> tuple boundaries for everybody. That adds a lot of unnecessary
> inter-process communication and synchronization. Each process should
> just get the next tuple starting after where the last one ended, and
> then advance the end pointer so that the next process can do the same
> thing. Vignesh's proposal involves having a leader process that has to
> switch roles - he picks an arbitrary 25% threshold - and if it doesn't
> switch roles at the right time, performance will be impacted. If the
> leader doesn't get scheduled in time to refill the queue before it
> runs completely empty, workers will have to wait. Ants's scheme avoids
> that risk: whoever needs the next tuple reads the next line. There's
> no need to ever wait for the leader because there is no leader.

I agree that if the leader switches the role, then it is possible that
sometimes the leader might not produce the work before the queue is
empty. OTOH, the problem with the approach you are suggesting is that
the work will be generated on-demand, i.e. there is no specific
process who is generating the data while workers are busy inserting
the data. So IMHO, if we have a specific leader process then there
will always be work available for all the workers. I agree that we
need to find the correct point when the leader will work as a worker.
One idea could be that when the queue is full and there is no space to
push more work to queue then the leader himself processes that work.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2020-04-09 11:54:59 Re: PG compilation error with Visual Studio 2015/2017/2019
Previous Message Tomas Vondra 2020-04-09 11:48:55 Re: Default setting for enable_hashagg_disk