Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, a(dot)lepikhov(at)postgrespro(dot)ru
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2022-08-29 13:02:38
Message-ID: 686e46d6718f9535bc39fac7cad7408b@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2022-08-25 01:54, Damir Belyalov wrote:
>>> + /* Buffer was filled, commit subtransaction and
>> prepare to replay */
>>> + ReleaseCurrentSubTransaction();
>> What is actually being committed by this
>> ReleaseCurrentSubTransaction()?
>> It seems to me that just safecstate->replay_buffer is fulfilled
>> before
>> this commit.
>
> All tuples are collected in replay_buffer, which is created in
> ''oldcontext'' of CopyFrom() (not context of a subtransaction). That's
> why it didn't clean up when you used
> RollbackAndReleaseCurrentSubTransaction().
> Subtransactions are needed for better safety. There is no error when
> copying from a file to the replay_buffer, but an error may occur at
> the next stage - when adding a tuple to the table. Also there may be
> other errors that are not obvious at first glance.

Thanks for the explanation and updating patch.
I now understand that the data being COPYed are not the target of
COMMIT.

Although in past discussions[1] it seems the data to be COPYed were also
subject to COMMIT, but I understand this patch has adopted another
design.

350 + /* Buffer was filled, commit subtransaction and
prepare to replay */
351 + ReleaseCurrentSubTransaction();
352 + CurrentResourceOwner = safecstate->oldowner;
353 +
354 + safecstate->replay_is_active = true;

BTW in v4 patch, data are loaded into the buffer one by one, and when
the buffer fills up, the data in the buffer are 'replayed' also one by
one, right?
Wouldn't this have more overhead than a normal COPY?

As a test, I COPYed slightly larger data with and without ignore_errors
option.
There might be other reasons, but I found a performance difference.

```
=# copy test from '/tmp/10000000.data' ;
COPY 10000000
Time: 6001.325 ms (00:06.001)

=# copy test from '/tmp/10000000.data' with (ignore_errors);
NOTICE: FIND 0 ERRORS
COPY 10000000
Time: 7711.555 ms (00:07.712)
```

>> I feel it might be better to have it as a parameter rather than
>> fixed at
>> 1000.
>
> Yes, I thought about it too. So I created another version with the GUC
> parameter - replay_buffer_size. Attached it. But I think users won't
> need to change replay_buffer_size.
> Also replay_buffer does the same thing as MultiInsert buffer does and
> MultiInsert buffer defined by const = 1000.

Yeah, when the data being COPYed are not the target of COMMIT, I also
think users won't neet to change it.
>
>> NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in
>> copyfrom.c.
>> Since safeNextCopyFrom() is analog of NextCopyFrom() as commented,
>> would
>> it be natural to put them in the same file?
>
> Sure, corrected it.

Thanks.

>
>>> 188 + safecstate->errors++;
>>> 189 + if (safecstate->errors <= 100)
>>> 190 + ereport(WARNING,
>>> 191 + (errcode(errdata->sqlerrcode),
>>> 192 + errmsg("%s", errdata->context)));
>>
>> It would be better to state in the manual that errors exceeding 100
>> are
>> not displayed.
>> Or, it might be acceptable to output all errors that exceed 100.
>
> It'll be too complicated to create a new parameter just for showing
> the given number of errors. I think 100 is an optimal size.

Yeah, I may have misled you, but I also don't think this needs a new
parameter.
I just thought 'either' of the following would be better:
- Put in the documentation that the warnings will not be output for more
than 101 cases.
- Output all the warnings.

>> It would be helpful to add comments about skip_row, etc.
>
> Corrected it.

Thanks.

>
>> WARNING: FIND 0 ERRORS
>> When there were no errors, this WARNING seems not necessary.
>
> Corrected it.

Thanks.

I applied v4 patch and when canceled the COPY, there was a case I found
myself left in a transaction.
Should this situation be prevented from occurring?

```
=# copy test from '/tmp/10000000.data' with (ignore_errors );

^CCancel request sent
ERROR: canceling statement due to user request

=# truncate test;
ERROR: current transaction is aborted, commands ignored until end of
transaction block
```

[1]
https://www.postgresql.org/message-id/1197677930.1536.18.camel%40dell.linuxdev.us.dell.com

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2022-08-29 13:35:41 Re: SQL/JSON features for v15
Previous Message Amit Langote 2022-08-29 12:56:40 Re: SQL/JSON features for v15