Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, a(dot)lepikhov(at)postgrespro(dot)ru
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2022-09-21 12:11:46
Message-ID: CALH1Lgsr7Asz6r6jnRta7DVEjKwyz=b3XJhyCMoYv8Yy9y4cXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for reviewing.
In the previous patch there was an error when processing constraints. The
patch was fixed, but the code grew up and became more complicated
(0005-COPY_IGNORE_ERRORS). I also simplified the logic of
safeNextCopyFrom().
You asked why we need subtransactions, so the answer is in this patch. When
processing a row that does not satisfy constraints or INSTEAD OF triggers,
it is necessary to rollback the subtransaction and return the table to its
original state.
Cause of complexity, I had to abandon the constraints, triggers processing
in and handle only errors that occur when reading the file. Attaching
simplified patch (0006-COPY_IGNORE_ERRORS).
Checked out these patches on all COPY regress tests and it worked correctly.

> BTW in v4 patch, data are loaded into the buffer one by one, and when
> the buffer fills up, the data in the buffer are 'replayed' also one by
> one, right?
> Wouldn't this have more overhead than a normal COPY?
>
The data is loaded into the buffer one by one, but in the "replay mode"
ignore_errors works as standard COPY. Tuples add to the slot and depending
on the type of the slot (CIM_SINGLE or CIM_MULTI) tuples are replayed in
the corresponding case.
For the 0006 patch you can imagine that we divide the loop for(;;) in 2
parts. The first part is adding tuples to the buffer and the second part is
inserting tuples to the table. These parts don't intersect with each other
and are completed sequentially.
The main idea of replay_buffer is that it is needed for the CIM_SINGLE
case. You can implement the CIM_SINGLE case and see that tuples before an
error occurring don't add to the table. Logic of the 0005 patch is similar
but with some differences.

As a test, I COPYed slightly larger data with and without ignore_errors
> option.
> There might be other reasons, but I found a performance difference.

Tried to reduce performance difference with cleaning up replay_buffer with
resetting the new context for replay_buffer - replay_cxt.
```
Before:
Without ignore_errors:
COPY 10000000
Time: 15538,579 ms (00:15,539)
With ignore_errors:
COPY 10000000
Time: 21289,121 ms (00:21,289)

After:
Without ignore_errors:
COPY 10000000
Time: 15318,922 ms (00:15,319)
With ignore_errors:
COPY 10000000
Time: 19868,175 ms (00:19,868)
```

- Put in the documentation that the warnings will not be output for more
> than 101 cases.
>
Yeah, I point it out in the doc.

> I applied v4 patch and when canceled the COPY, there was a case I found
> myself left in a transaction.
> Should this situation be prevented from occurring?
>
> ```
> =# copy test from '/tmp/10000000.data' with (ignore_errors );
>
> ^CCancel request sent
> ERROR: canceling statement due to user request
>
> =# truncate test;
> ERROR: current transaction is aborted, commands ignored until end of
> transaction block
> ```
>
Tried to implement your error and could not. The result was the same as
COPY FROM implements.

Attachment Content-Type Size
0005-COPY_IGNORE_ERRORS.patch text/x-patch 27.1 KB
0006-COPY_IGNORE_ERRORS.patch text/x-patch 23.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-09-21 12:17:30 Re: [PATCH] polish the error message of creating proc
Previous Message Alvaro Herrera 2022-09-21 11:58:37 Re: tweak to a few index tests to hits ambuildempty() routine.