Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: jian he <jian(dot)universality(at)gmail(dot)com>
Cc: Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>, Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>, zhihuifan1213(at)163(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Gustafsson <daniel(at)yesql(dot)se>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, anisimow(dot)d(at)gmail(dot)com, HukuToc(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2023-12-14 20:48:52
Message-ID: CAD21AoCeEOBN49fu43e6tBTynnswugA3oZ5AZvLeyDCpxpCXPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Dec 12, 2023 at 10:04 PM jian he <jian(dot)universality(at)gmail(dot)com> wrote:
>
> On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina
> <lena(dot)ribackina(at)yandex(dot)ru> wrote:
> >
> > Hi! Thank you for your work. Your patch looks better!
> > Yes, thank you! It works fine, and I see that the regression tests have been passed. 🙂
> > However, when I ran 'copy from with save_error' operation with simple csv files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created it, I described below):
> >
> > postgres=# create table test (x int primary key, y int not null);
> > postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
> > FOREIGN KEY(x)
> > REFERENCES test(x));
> >
> > I did not find a table with saved errors after operation, although I received a log about it:
> >
> > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV save_error
> > NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test_error
> > ERROR: duplicate key value violates unique constraint "test_pkey"
> > DETAIL: Key (x)=(2) already exists.
> > CONTEXT: COPY test, line 3
> >
> > postgres=# select * from public.test_error;
> > ERROR: relation "public.test_error" does not exist
> > LINE 1: select * from public.test_error;
> >
> > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
> > NOTICE: 2 rows were skipped because of error. skipped row saved to table public.test1_error
> > ERROR: insert or update on table "test1" violates foreign key constraint "fk_x"
> > DETAIL: Key (x)=(2) is not present in table "test".
> >
> > postgres=# select * from public.test1_error;
> > ERROR: relation "public.test1_error" does not exist
> > LINE 1: select * from public.test1_error;
> >
> > Two lines were written correctly in the csv files, therefore they should have been added to the tables, but they were not added to the tables test and test1.
> >
> > If I leave only the correct rows, everything works fine and the rows are added to the tables.
> >
> > in copy_test.csv:
> >
> > 2,0
> >
> > 1,1
> >
> > in copy_test1.csv:
> >
> > 2,0
> >
> > 2,1
> >
> > 1,1
> >
> > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
> > COPY 2
> > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV save_error
> > NOTICE: No error happened.Error holding table public.test1_error will be droped
> > COPY 3
> >
> > Maybe I'm launching it the wrong way. If so, let me know about it.
>
> looks like the above is about constraints violation while copying.
> constraints violation while copying not in the scope of this patch.
>
> Since COPY FROM is very like the INSERT command,
> you do want all the valid constraints to check all the copied rows?
>
> but the notice raised by the patch is not right.
> So I place the drop error saving table or raise notice logic above
> `ExecResetTupleTable(estate->es_tupleTable, false)` in the function
> CopyFrom.
>
> >
> > I also notice interesting behavior if the table was previously created by the user. When I was creating an error_table before the 'copy from' operation,
> > I received a message saying that it is impossible to create a table with the same name (it is shown below) during the 'copy from' operation.
> > I think you should add information about this in the documentation, since this seems to be normal behavior to me.
> >
>
> doc changed. you may check it.

I've read this thread and the latest patch. IIUC with SAVE_ERROR
option, COPY FROM creates an error table for the target table and
writes error information there.

While I agree that the final shape of this feature would be something
like that design, I'm concerned some features are missing in order to
make this feature useful in practice. For instance, error logs are
inserted to error tables without bounds, meaning that users who want
to tolerate errors during COPY FROM will have to truncate or drop the
error tables periodically, or the database will grow with error logs
without limit. Ideally such maintenance work should be done by the
database. There might be some users who want to log such conversion
errors in server logs to avoid such maintenance work. I think we
should provide an option for where to write, at least. Also, since the
error tables are normal user tables internally, error logs are also
replicated to subscribers if there is a publication FOR ALL TABLES,
unlike system catalogs. I think some users would not like such
behavior.

Looking at SAVE_ERROR feature closely, I think it consists of two
separate features. That is, it enables COPY FROM to load data while
(1) tolerating errors and (2) logging errors to somewhere (i.e., an
error table). If we implement only (1), it would be like COPY FROM
tolerate errors infinitely and log errors to /dev/null. The user
cannot see the error details but I guess it could still help some
cases as Andres mentioned[1] (it might be a good idea to send the
number of rows successfully loaded in a NOTICE message if some rows
could not be loaded). Then with (2), COPY FROM can log error
information to somewhere such as tables and server logs and the user
can select it. So I'm thinking we may be able to implement this
feature incrementally. The first step would be something like an
option to ignore all errors or an option to specify the maximum number
of errors to tolerate before raising an ERROR. The second step would
be to support logging destinations such as server logs and tables.

Regards,

[1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-12-14 20:53:36 pg_serial bloat
Previous Message Masahiko Sawada 2023-12-14 20:19:43 Re: Make COPY format extendable: Extract COPY TO format implementations