Re: Parallel copy

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, vignesh C <vignesh21(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-04-10 11:40:06
Message-ID: CA+TgmoYzSDmgfzbhfvrotXWtFXfk1jK_nRnM11ZzEYxaLP=+uQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 9, 2020 at 4:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I've not yet read the whole thread. So I'm probably restating ideas.

Yeah, but that's OK.

> Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the leader is the only process with access to the network socket. It should load input data into one large buffer that's shared across processes. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets. Worker processes should grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the worker chunks should be reduced in size.

My concern here is that it's going to be hard to avoid processes going
idle. If the leader does nothing at all once the ring buffer is full,
it's wasting time that it could spend processing a chunk. But if it
picks up a chunk, then it might not get around to refilling the buffer
before other processes are idle with no work to do.

Still, it might be the case that having the process that is reading
the data also find the line endings is so fast that it makes no sense
to split those two tasks. After all, whoever just read the data must
have it in cache, and that helps a lot.

> Given that everything stalls if the leader doesn't accept further input data, as well as when there are no available splitted chunks, it doesn't seem like a good idea to have the leader do other work.
>
> I don't think optimizing/targeting copy from local files, where multiple processes could read, is useful. COPY STDIN is the only thing that practically matters.

Yeah, I think Amit has been thinking primarily in terms of COPY from
files, and I've been encouraging him to at least consider the STDIN
case. But I think you're right, and COPY FROM STDIN should be the
design center for this feature.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-04-10 11:56:57 Re: SyncRepLock acquired exclusively in default configuration
Previous Message Fujii Masao 2020-04-10 09:57:12 Re: SyncRepLock acquired exclusively in default configuration