Re: Small fix on COPY ON_ERROR document

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Small fix on COPY ON_ERROR document
Date: 2024-01-26 06:33:22
Message-ID: CAKFQuwZJZ6uaS2B7qpL2FJzWBsnDdzgtbsW3pH9zuT6vC3fH+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 25, 2024 at 10:40 PM Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote:

> On Fri, 26 Jan 2024 13:59:09 +0900
> Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > On Fri, Jan 26, 2024 at 11:28 AM Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
> wrote:
> > >
> > > Hi,
> > >
> > > I found that the documentation of COPY ON_ERROR said
> > > COPY stops operation at the first error when
> > > "ON_ERROR is not specified.", but it also stop when
> > > ON_ERROR is specified to the default value, "stop".
> > >
>
>
I'm finding the current wording surrounding ON_ERROR to not fit in with the
rest of the page. I've tried to improve things by being a bit more
invasive.

This first hunk updates the description of the COPY command to include
describing the purpose of the ON_ERROR clause.

I too am concerned about the word "parsed" here, and "malformed" in the
more detailed descriptions; this would need to be modified to better
reflect the circumstances under which ignore happens.

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..5fe4c9f747 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -90,6 +90,14 @@ COPY { <replaceable
class="parameter">table_name</replaceable> [ ( <replaceable
in the <structname>pg_stat_progress_copy</structname> view. See
<xref linkend="copy-progress-reporting"/> for details.
</para>
+
+ <para>
+ By default, <command>COPY</command> will abort if it encounters errors.
+ For non-binary format <command>COPY FROM</command> the user can specify
+ to instead ignore input rows that cannot be parsed by specifying
+ the <literal>ignore</literal> option to the <literal>ON_ERROR</literal>
+ clause.
+ </para>
</refsect1>

<refsect1>

The use of the word "Currently" stood out to me when reading the next
hunk. We always document current reality and don't call out that fact. We
do not imply some future feature may change how this all works. On a
related note, right now we have stop and ignore, which are basically enums
just like we have for format (csv, text, binary). Unlike format, though,
this option requires single quotes. Why is that? I did keep with not
specifying the enum in the main syntax block since format doesn't as well;
though writing { stop | ignore } seemed quite appealing.

The rest of the page indicates what the default is in a sentence by itself,
not as a parenthetical next to the option.

The command limitations seemed worthy of a separate paragraph, though it
repeats the description which isn't great. I'm going to sleep on this one.

@@ -379,17 +387,16 @@ COPY { <replaceable
class="parameter">table_name</replaceable> [ ( <replaceable
<listitem>
<para>
Specifies which <replaceable class="parameter">
- error_action</replaceable> to perform when there is malformed data
in the input.
- Currently, only <literal>stop</literal> (default) and
<literal>ignore</literal>
- values are supported.
- If the <literal>stop</literal> value is specified,
- <command>COPY</command> stops operation at the first error.
- If the <literal>ignore</literal> value is specified,
- <command>COPY</command> skips malformed data and continues copying
data.
- The option is allowed only in <command>COPY FROM</command>.
- Only <literal>stop</literal> value is allowed when
- using <literal>binary</literal> format.
+ error_action</replaceable> to perform when encountering a malformed
input row:
+ <literal>stop</literal> or <literal>ignore</literal>.
+ A value of <literal>stop</literal> means fail the command, while
+ <literal>ignore</literal> means discard the input row and continue
with the next one.
+ The default is <literal>stop</literal>
</para>
+ <para>
+ The ignore option is applicable only for <command>COPY FROM</command>
+ when the <literal>FORMAT</literal> is <literal>text</literal>
+ or <literal>csv</literal>.
+ </para>
</listitem>
</varlistentry>

The following was, IMO, too much text about something not to worry about,
COPY TO and ignore mode. The point is dead tuples on error and running
vacuum. It doesn't really even matter what caused the error, though if the
error was even before rows started to be imported then obviously there
would be no dead tuples.

Oh, and the word "first" really isn't adding anything here that we cannot
reasonably leave to common sense, IMO. We either ignore all errors or stop
on the first one. There isn't a stop mode that is capable of bypassing
errors and the ignore mode doesn't have a "first n" option so one assumes
all errors are ignored.

@@ -576,15 +583,12 @@ COPY <replaceable
class="parameter">count</replaceable>
</para>

<para>
- <command>COPY</command> stops operation at the first error when
- <literal>ON_ERROR</literal> is not specified. This
- should not lead to problems in the event of a <command>COPY
- TO</command>, but the target table will already have received
- earlier rows in a <command>COPY FROM</command>. These rows will not
- be visible or accessible, but they still occupy disk space. This might
- amount to a considerable amount of wasted disk space if the failure
- happened well into a large copy operation. You might wish to invoke
- <command>VACUUM</command> to recover the wasted space.
+ A failed <command>COPY FROM</command> command will have performed
+ physical insertion of all rows prior to the malformed one.
+ While these rows will not be visible or accessible, they still occupy
+ disk space. This might amount to a considerable amount of wasted disk
+ space if the failure happened well into a large copy operation.
+ <command>VACUUM</command> should be used to recover the wasted space.
</para>

<para>

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-01-26 06:42:33 Re: Popcount optimization using AVX512
Previous Message Peter Eisentraut 2024-01-26 06:28:15 Re: make dist using git archive