Re: Fast COPY FROM based on batch insert

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Cc: Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, tanghy(dot)fnst(at)fujitsu(dot)com, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, houzj(dot)fnst(at)fujitsu(dot)com
Subject: Re: Fast COPY FROM based on batch insert
Date: 2022-08-09 16:13:02
Message-ID: CALNJ-vQd-bMrydDzUg5+fr9w+gTnErCADhk3KSi72_RCcKJw=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 9, 2022 at 4:45 AM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
wrote:

> On Tue, Jul 26, 2022 at 7:19 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
> wrote:
> > Yeah, I think the patch is getting better, but I noticed some issues,
> > so I'm working on them. I think I can post a new version in the next
> > few days.
>
> * When running AFTER ROW triggers in CopyMultiInsertBufferFlush(), the
> patch uses the slots passed to ExecForeignBatchInsert(), not the ones
> returned by the callback function, but I don't think that that is
> always correct, as the documentation about the callback function says:
>
> The return value is an array of slots containing the data that was
> actually inserted (this might differ from the data supplied, for
> example as a result of trigger actions.)
> The passed-in <literal>slots</literal> can be re-used for this
> purpose.
>
> postgres_fdw re-uses the passed-in slots, but other FDWs might not, so
> I modified the patch to reference the returned slots when running the
> AFTER ROW triggers. I also modified the patch to initialize the
> tts_tableOid. Attached is an updated patch, in which I made some
> minor adjustments to CopyMultiInsertBufferFlush() as well.
>
> * The patch produces incorrect error context information:
>
> create extension postgres_fdw;
> create server loopback foreign data wrapper postgres_fdw options
> (dbname 'postgres');
> create user mapping for current_user server loopback;
> create table t1 (f1 int, f2 text);
> create foreign table ft1 (f1 int, f2 text) server loopback options
> (table_name 't1');
> alter table t1 add constraint t1_f1positive check (f1 >= 0);
> alter foreign table ft1 add constraint ft1_f1positive check (f1 >= 0);
>
> — single insert
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> -1 foo
> >> 1 bar
> >> \.
> ERROR: new row for relation "t1" violates check constraint "t1_f1positive"
> DETAIL: Failing row contains (-1, foo).
> CONTEXT: remote SQL command: INSERT INTO public.t1(f1, f2) VALUES ($1, $2)
> COPY ft1, line 1: "-1 foo"
>
> — batch insert
> alter server loopback options (add batch_size '2');
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> -1 foo
> >> 1 bar
> >> \.
> ERROR: new row for relation "t1" violates check constraint "t1_f1positive"
> DETAIL: Failing row contains (-1, foo).
> CONTEXT: remote SQL command: INSERT INTO public.t1(f1, f2) VALUES
> ($1, $2), ($3, $4)
> COPY ft1, line 3
>
> In single-insert mode the error context information is correct, but in
> batch-insert mode it isn’t (i.e., the line number isn’t correct).
>
> The error occurs on the remote side, so I'm not sure if there is a
> simple fix. What I came up with is to just suppress error context
> information other than the relation name, like the attached. What do
> you think about that?
>
> (In CopyMultiInsertBufferFlush() your patch sets cstate->cur_lineno to
> buffer->linenos[i] even when running AFTER ROW triggers for the i-th
> row returned by ExecForeignBatchInsert(), but that wouldn’t always be
> correct, as the i-th returned row might not correspond to the i-th row
> originally stored in the buffer as the callback function returns only
> the rows that were actually inserted on the remote side. I think the
> proposed fix would address this issue as well.)
>
> * The patch produces incorrect row count in cases where some/all of
> the rows passed to ExecForeignBatchInsert() weren’t inserted on the
> remote side:
>
> create function trig_null() returns trigger as $$ begin return NULL;
> end $$ language plpgsql;
> create trigger trig_null before insert on t1 for each row execute
> function trig_null();
>
> — single insert
> alter server loopback options (drop batch_size);
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> 0 foo
> >> 1 bar
> >> \.
> COPY 0
>
> — batch insert
> alter server loopback options (add batch_size '2');
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> 0 foo
> >> 1 bar
> >> \.
> COPY 2
>
> The row count is correct in single-insert mode, but isn’t in batch-insert
> mode.
>
> The reason is that in batch-insert mode the row counter is updated
> immediately after adding the row to the buffer, not after doing
> ExecForeignBatchInsert(), which might ignore the row. To fix, I
> modified the patch to delay updating the row counter (and the progress
> of the COPY command) until after doing the callback function. For
> consistency, I also modified the patch to delay it even when batching
> into plain tables. IMO I think that that would be more consistent
> with the single-insert mode, as in that mode we update them after
> writing the tuple out to the table or sending it to the remote side.
>
> * I modified the patch so that when batching into foreign tables we
> skip useless steps in CopyMultiInsertBufferInit() and
> CopyMultiInsertBufferCleanup().
>
> That’s all I have for now. Sorry for the delay.
>
> Best regards,
> Etsuro Fujita
>

Hi,

+ /* If any rows were inserted, run AFTER ROW INSERT triggers. */
...
+ for (i = 0; i < inserted; i++)
+ {
+ TupleTableSlot *slot = rslots[i];
...
+ slot->tts_tableOid =
+ RelationGetRelid(resultRelInfo->ri_RelationDesc);

It seems the return value of
`RelationGetRelid(resultRelInfo->ri_RelationDesc)` can be stored in a
variable outside the for loop.
Inside the for loop, assign this variable to slot->tts_tableOid.

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2022-08-09 16:16:33 Re: SUBTRANS: Minimizing calls to SubTransSetParent()
Previous Message Magnus Hagander 2022-08-09 16:12:16 Re: moving basebackup code to its own directory