Re: Parallel copy

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-11-03 12:35:32
Message-ID: 60fb6859-60f4-4a5b-1866-73c3cd2f0d16@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/11/2020 10:59, Amit Kapila wrote:
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> However, the point of parallel copy is to maximize bandwidth.
>
> Okay, but this first-phase (finding the line boundaries) can anyway
> be not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process
> or multiple processes.
Right, it won't matter performance-wise. That's not my point. The
difference is in the complexity. If you don't store the line boundaries
in shared memory, you get away with much simpler shared memory structures.

> I think something close to that is discussed as you have noticed in
> your next email but IIRC, because many people (Andres, Ants, myself
> and author) favoured the current approach (single reader and multiple
> consumers) we decided to go with that. I feel this patch is very much
> in the POC stage due to which the code doesn't look good and as we
> move forward we need to see what is the better way to improve it,
> maybe one of the ways is to split it as you are suggesting so that it
> can be easier to review.

Sure. I think the roadmap here is:

1. Split copy.c [1]. Not strictly necessary, but I think it'd make this
nice to review and work with.

2. Refactor CopyReadLine(), so that finding the line-endings and the
rest of the line-parsing are separated into separate functions.

3. Implement parallel copy.

> I think the other important thing which this
> patch has not addressed properly is the parallel-safety checks as
> pointed by me earlier. There are two things to solve there (a) the
> lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
> APIs, etc.) have checks which doesn't allow any writes, we need to see
> which of those we can open now (or do some additional work to prevent
> from those checks) after some of the work done for parallel-writes in
> PG-13[1][2], and (b) in which all cases we can parallel-writes
> (parallel copy) is allowed, for example need to identify whether table
> or one of its partitions has any constraint/expression which is
> parallel-unsafe.

Agreed, that needs to be solved. I haven't given it any thought myself.

- Heikki

[1]
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-11-03 12:36:55 Re: Re: parallel distinct union and aggregate support patch
Previous Message Heikki Linnakangas 2020-11-03 12:05:58 Re: ModifyTable overheads in generic plans