Re: Parallel copy

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-04-08 22:24:59
Message-ID: CANwKhkN8jEeKREkM+g0RqPHwT=AkH+Qb3LpEAkb=wPKHMZfS8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> - If we're unable to supply data to the COPY process as fast as the
> workers could load it, then speed will be limited at that point. We
> know reading the file from disk is pretty fast compared to what a
> single process can do. I'm not sure we've tested what happens with a
> network socket. It will depend on the network speed some, but it might
> be useful to know how many MB/s we can pump through over a UNIX
> socket.

This raises a good point. If at some point we want to minimize the
amount of memory copies then we might want to allow for RDMA to
directly write incoming network traffic into a distributing ring
buffer, which would include the protocol level headers. But at this
point we are so far off from network reception becoming a bottleneck I
don't think it's worth holding anything up for not allowing for zero
copy transfers.

> - The portion of the time that is used to split the lines is not
> easily parallelizable. That seems to be a fairly small percentage for
> a reasonably wide table, but it looks significant (13-18%) for a
> narrow table. Such cases will gain less performance and be limited to
> a smaller number of workers. I think we also need to be careful about
> files whose lines are longer than the size of the buffer. If we're not
> careful, we could get a significant performance drop-off in such
> cases. We should make sure to pick an algorithm that seems like it
> will handle such cases without serious regressions and check that a
> file composed entirely of such long lines is handled reasonably
> efficiently.

I don't have a proof, but my gut feel tells me that it's fundamentally
impossible to ingest csv without a serial line-ending/comment
tokenization pass. The current line splitting algorithm is terrible.
I'm currently working with some scientific data where on ingestion
CopyReadLineText() is about 25% on profiles. I prototyped a
replacement that can do ~8GB/s on narrow rows, more on wider ones.

For rows that are consistently wider than the input buffer I think
parallelism will still give a win - the serial phase is just memcpy
through a ringbuffer, after which a worker goes away to perform the
actual insert, letting the next worker read the data. The memcpy is
already happening today, CopyReadLineText() copies the input buffer
into a StringInfo, so the only extra work is synchronization between
leader and worker.

> - There could be index contention. Let's suppose that we can read data
> super fast and break it up into lines super fast. Maybe the file we're
> reading is fully RAM-cached and the lines are long. Now all of the
> backends are inserting into the indexes at the same time, and they
> might be trying to insert into the same pages. If so, lock contention
> could become a factor that hinders performance.

Different data distribution strategies can have an effect on that.
Dealing out input data in larger or smaller chunks will have a
considerable effect on contention, btree page splits and all kinds of
things. I think the common theme would be a push to increase chunk
size to reduce contention..

> - There could also be similar contention on the heap. Say the tuples
> are narrow, and many backends are trying to insert tuples into the
> same heap page at the same time. This would lead to many lock/unlock
> cycles. This could be avoided if the backends avoid targeting the same
> heap pages, but I'm not sure there's any reason to expect that they
> would do so unless we make some special provision for it.

I thought there already was a provision for that. Am I mis-remembering?

> - What else? I bet the above list is not comprehensive.

I think parallel copy patch needs to concentrate on splitting input
data to workers. After that any performance issues would be basically
the same as a normal parallel insert workload. There may well be
bottlenecks there, but those could be tackled independently.

Regards,
Ants Aasma
Cybertec

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-04-08 22:25:34 Re: Improving connection scalability: GetSnapshotData()
Previous Message Jeff Davis 2020-04-08 22:24:39 Re: explain HashAggregate to report bucket and memory stats