Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, Daniel Gustafsson <daniel(at)yesql(dot)se>, Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Danil Anisimow <anisimow(dot)d(at)gmail(dot)com>, Nikita Malakhov <HukuToc(at)gmail(dot)com>, a(dot)lepikhov(at)postgrespro(dot)ru
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2023-03-27 13:51:32
Message-ID: 19551e8c2717c24689913083f841ddb5@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2023-03-23 02:50, Andres Freund wrote:
Thanks again for your review.
Attached v5 patch.

> Have you measured whether this has negative performance effects when
> *NOT*
> using the new option?

I loaded 10000000 rows of pgbench_accounts on my laptop and compared the
elapsed time.
GUCs changed from the default are logging_collector = on,
log_error_verbosity = verbose.

Three tests were run under each condition and the middle of them is
listed below:

- patch NOT applied(36f40ce2dc66): 35299ms
- patch applied, without IGNORE_DATATYPE_ERRORS: 34409ms
- patch applied, with IGNORE_DATATYPE_ERRORS: 35510ms

It seems there are no significant degradation.

Also tested the elapsed time when loading data which has some datatype
error with IGNORE_DATATYPE_ERRORS:
- data has 100 rows of error: 35269ms
- data has 1000 rows of error: 34577ms
- data has 5000000 rows of error: 48925ms

5000000 rows of error consumes much time, but it seems to be influenced
by logging time.
Here are test results under log_min_messages and client_min_messages are
'error':
- data has 5000000 data type error: 23972ms
- data has 0 data type error: 34320ms

Now conversely, when there were many data type errors, it consumes less
time.
This seems like a reasonable result since the amount of skipped data is
increasing.

> As-is this does not work with FORMAT BINARY - and converting the binary
> input
> functions to support soft errors won't happen for 16. So I think you
> need to
> raise an error if BINARY and IGNORE_DATATYPE_ERRORS are specified.

Added the option check.

> On 2023-03-22 22:34:20 +0900, torikoshia wrote:
> > @@ -985,9 +986,28 @@ CopyFrom(CopyFromState cstate)
> >
> > ExecClearTuple(myslot);
> >
> > + if (cstate->opts.ignore_datatype_errors)
> > + {
> > + escontext.details_wanted = true;
> > + cstate->escontext = escontext;
> > + }
>
> I think it might be worth pulling this out of the loop. That does mean
> you'd
> have to reset escontext.error_occurred after an error, but that doesn't
> seem
> too bad, you need to do other cleanup anyway.

Pull this out of the loop and added process for resetting escontext.

> > @@ -956,10 +957,20 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
> > values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
> > }
> > else
> > - values[m] = InputFunctionCall(&in_functions[m],
> > - string,
> > - typioparams[m],
> > - att->atttypmod);
> > + /* If IGNORE_DATATYPE_ERRORS is enabled skip rows with datatype errors */
> > + if (!InputFunctionCallSafe(&in_functions[m],
> > + string,
> > + typioparams[m],
> > + att->atttypmod,
> > + (Node *) &cstate->escontext,
> > + &values[m]))
> > + {
> > + cstate->ignored_errors++;
> > +
> > + ereport(WARNING,
> > + errmsg("%s", cstate->escontext.error_data->message));
>
> That isn't right - you loose all the details of the message. As is
> you'd also
> leak the error context.
>
> I think the best bet for now is to do something like
> /* adjust elevel so we don't jump out */
> cstate->escontext.error_data->elevel = WARNING;
> /* despite the name, this won't raise an error if elevel < ERROR */
> ThrowErrorData(cstate->escontext.error_data);

As I mentioned in one previous email, added above codes for now.

> I wonder if we ought to provide a wrapper for this? It could e.g. know
> to
> mention the original elevel and such?
>
>
> I don't think NextCopyFrom() is the right place to emit this warning -
> it
> e.g. is also called from file_fdw.c, which might want to do something
> else
> with the error. From a layering POV it seems cleaner to do this in
> CopyFrom(). You already have a check for escontext.error_occurred there
> anyway.

Agreed.

> > @@ -3378,6 +3378,10 @@ copy_opt_item:
> > {
> > $$ = makeDefElem("freeze", (Node *) makeBoolean(true), @1);
> > }
> > + | IGNORE_DATATYPE_ERRORS
> > + {
> > + $$ = makeDefElem("ignore_datatype_errors", (Node *)makeBoolean(true), @1);
> > + }
> > | DELIMITER opt_as Sconst
> > {
> > $$ = makeDefElem("delimiter", (Node *) makeString($3), @1);
>
>
> I think we shouldn't add a new keyword for this, but only support this
> via
> /* new COPY option syntax */
> copy_generic_opt_list:
> copy_generic_opt_elem
>
> Further increasing the size of the grammar with random keywords when we
> have
> more generic ways to represent them seems unnecessary.

Agreed.

> > +-- tests for IGNORE_DATATYPE_ERRORS option
> > +CREATE TABLE check_ign_err (n int, m int[], k int);
> > +COPY check_ign_err FROM STDIN WITH IGNORE_DATATYPE_ERRORS;
> > +1 {1} 1
> > +a {2} 2
> > +3 {3} 3333333333
> > +4 {a, 4} 4
> > +
> > +5 {5} 5
> > +\.
> > +SELECT * FROM check_ign_err;
> > +
>
> I suggest adding a few more tests:
>
> - COPY with a datatype error that can't be handled as a soft error

Added a test for cases with missing columns.
However it's not datatype error and not what you expected, is it?

> - test documenting that COPY FORMAT BINARY is incompatible with
> IGNORE_DATATYPE_ERRORS

Added it.

> - a soft error showing the error context - although that will require
> some
> care to avoid the function name + line in the output

I assume you mean a test to check the server log, but I haven't come up
with a way to do it.
Adding a TAP test might do it, but I think it would be overkill to add
one just for this.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Attachment Content-Type Size
v5-0001-Add-new-COPY-option-IGNORE_DATATYPE_ERRORS.patch text/x-diff 12.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-03-27 13:55:27 Re: cataloguing NOT NULL constraints
Previous Message parth ratra 2023-03-27 13:42:42 facing issues in downloading of packages in pgadmin4