Re: Parallel copy

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-04-08 19:30:04
Message-ID: CA+TgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 7, 2020 at 9:38 AM Ants Aasma <ants(at)cybertec(dot)at> wrote:
> I think the element based approach and requirement that all tuples fit
> into the queue makes things unnecessarily complex. The approach I
> detailed earlier allows for tuples to be bigger than the buffer. In
> that case a worker will claim the long tuple from the ring queue of
> tuple start positions, and starts copying it into its local line_buf.
> This can wrap around the buffer multiple times until the next start
> position shows up. At that point this worker can proceed with
> inserting the tuple and the next worker will claim the next tuple.
>
> This way nothing needs to be resized, there is no risk of a file with
> huge tuples running the system out of memory because each element will
> be reallocated to be huge and the number of elements is not something
> that has to be tuned.

+1. This seems like the right way to do it.

> > We had a couple of options for the way in which queue elements can be stored.
> > Option 1: Each element (DSA chunk) will contain tuples such that each
> > tuple will be preceded by the length of the tuple. So the tuples will
> > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> > tuple-2), .... Or Option 2: Each element (DSA chunk) will contain only
> > tuples (tuple-1), (tuple-2), ..... And we will have a second
> > ring-buffer which contains a start-offset or length of each tuple. The
> > old design used to generate one tuple of data and process tuple by
> > tuple. In the new design, the server will generate multiple tuples of
> > data per queue element. The worker will then process data tuple by
> > tuple. As we are processing the data tuple by tuple, I felt both of
> > the options are almost the same. However Design1 was chosen over
> > Design 2 as we can save up on some space that was required by another
> > variable in each element of the queue.
>
> With option 1 it's not possible to read input data into shared memory
> and there needs to be an extra memcpy in the time critical sequential
> flow of the leader. With option 2 data could be read directly into the
> shared memory buffer. With future async io support, reading and
> looking for tuple boundaries could be performed concurrently.

But option 2 still seems significantly worse than your proposal above, right?

I really think we don't want a single worker in charge of finding
tuple boundaries for everybody. That adds a lot of unnecessary
inter-process communication and synchronization. Each process should
just get the next tuple starting after where the last one ended, and
then advance the end pointer so that the next process can do the same
thing. Vignesh's proposal involves having a leader process that has to
switch roles - he picks an arbitrary 25% threshold - and if it doesn't
switch roles at the right time, performance will be impacted. If the
leader doesn't get scheduled in time to refill the queue before it
runs completely empty, workers will have to wait. Ants's scheme avoids
that risk: whoever needs the next tuple reads the next line. There's
no need to ever wait for the leader because there is no leader.

I think it's worth enumerating some of the other ways that a project
in this area can fail to achieve good speedups, so that we can try to
avoid those that are avoidable and be aware of the others:

- If we're unable to supply data to the COPY process as fast as the
workers could load it, then speed will be limited at that point. We
know reading the file from disk is pretty fast compared to what a
single process can do. I'm not sure we've tested what happens with a
network socket. It will depend on the network speed some, but it might
be useful to know how many MB/s we can pump through over a UNIX
socket.

- The portion of the time that is used to split the lines is not
easily parallelizable. That seems to be a fairly small percentage for
a reasonably wide table, but it looks significant (13-18%) for a
narrow table. Such cases will gain less performance and be limited to
a smaller number of workers. I think we also need to be careful about
files whose lines are longer than the size of the buffer. If we're not
careful, we could get a significant performance drop-off in such
cases. We should make sure to pick an algorithm that seems like it
will handle such cases without serious regressions and check that a
file composed entirely of such long lines is handled reasonably
efficiently.

- There could be index contention. Let's suppose that we can read data
super fast and break it up into lines super fast. Maybe the file we're
reading is fully RAM-cached and the lines are long. Now all of the
backends are inserting into the indexes at the same time, and they
might be trying to insert into the same pages. If so, lock contention
could become a factor that hinders performance.

- There could also be similar contention on the heap. Say the tuples
are narrow, and many backends are trying to insert tuples into the
same heap page at the same time. This would lead to many lock/unlock
cycles. This could be avoided if the backends avoid targeting the same
heap pages, but I'm not sure there's any reason to expect that they
would do so unless we make some special provision for it.

- These problems could also arise with respect to TOAST table
insertions, either on the TOAST table itself or on its index. This
would only happen if the table contains a lot of toastable values, but
that could be the case: imagine a table with a bunch of columns each
of which contains a long string that isn't very compressible.

- What else? I bet the above list is not comprehensive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anna Akenteva 2020-04-08 19:36:28 Re: [HACKERS] make async slave to wait for lsn to be replayed
Previous Message Justin Pryzby 2020-04-08 18:38:54 Re: Vacuum o/p with (full 1, parallel 0) option throwing an error