Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2021-12-19 05:09:41
Message-ID: CAFj8pRC96C+C6Gg8OHkv26YZMQsrRJ9YtJhBKrrrje1hc2yrcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

so 18. 12. 2021 v 9:55 odesílatel Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>
napsal:

> Hello.
>
> Wrote a patch implementing COPY with ignoring errors in rows using block
> subtransactions.
>

It is great so you are working on this patch. Unfortunately, I am afraid
this simple design is not optimal. Using subtransaction for every row has
too big overhead. I think it should use subtransaction for blocks of rows
(1000 rows), and only when there is an exception, then it should replay
inserts in subtransaction per row. You should check performance overhead.

Regards

Pavel

> Syntax: COPY [table] FROM [file/stdin] WITH IGNORE_ERROS;
>
> Examples:
> CREATE TABLE check_ign_err (n int, m int, k int);
> COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
> 1 1 1
> 2 2 2 2
> 3 3 3
> \.
> WARNING: COPY check_ign_err, line 2: "2 2 2 2"
> SELECT * FROM check_ign_err;
> n | m | k
> ---+---+---
> 1 | 1 | 1
> 3 | 3 | 3
> (2 rows)
>
> ##################################################
>
> TRUNCATE check_ign_err;
> COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
> 1 1 1
> 2 2
> 3 3 3
> \.
> WARNING: COPY check_ign_err, line 2: "2 2"
> SELECT * FROM check_ign_err;
> n | m | k
> ---+---+---
> 1 | 1 | 1
> 3 | 3 | 3
> (2 rows)
>
> ##################################################
>
> TRUNCATE check_ign_err;
> COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
> 1 1 1
> 2 a 2
> 3 3 3
> \.
> WARNING: COPY check_ign_err, line 2, column m: "a"
> SELECT * FROM check_ign_err;
> n | m | k
> ---+---+---
> 1 | 1 | 1
> 3 | 3 | 3
> (2 rows)
>
>
>
> Regards, Damir
>
> пт, 10 дек. 2021 г. в 21:48, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>
>>
>>
>> 2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>>
>>>
>>>
>>> 2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash(at)commandprompt(dot)com>:
>>>
>>>> Trent Shipley <trent_shipley(at)qwest(dot)net> writes:
>>>>
>>>> > On Friday 2007-12-14 16:22, Tom Lane wrote:
>>>> >> Neil Conway <neilc(at)samurai(dot)com> writes:
>>>> >> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct
>>>> COPY
>>>> >> > to drop (and log) rows that contain malformed data. That is, rows
>>>> with
>>>> >> > too many or too few columns, rows that result in constraint
>>>> violations,
>>>> >> > and rows containing columns where the data type's input function
>>>> raises
>>>> >> > an error. The last case is the only thing that would be a bit
>>>> tricky to
>>>> >> > implement, I think: you could use PG_TRY() around the
>>>> InputFunctionCall,
>>>> >> > but I guess you'd need a subtransaction to ensure that you reset
>>>> your
>>>> >> > state correctly after catching an error.
>>>> >>
>>>> >> Yeah. It's the subtransaction per row that's daunting --- not only
>>>> the
>>>> >> cycles spent for that, but the ensuing limitation to 4G rows imported
>>>> >> per COPY.
>>>> >
>>>> > You could extend the COPY FROM syntax with a COMMIT EVERY n clause.
>>>> This
>>>> > would help with the 4G subtransaction limit. The cost to the ETL
>>>> process is
>>>> > that a simple rollback would not be guaranteed send the process back
>>>> to it's
>>>> > initial state. There are easy ways to deal with the rollback issue
>>>> though.
>>>> >
>>>> > A {NO} RETRY {USING algorithm} clause might be useful. If the NO
>>>> RETRY
>>>> > option is selected then the COPY FROM can run without subtransactions
>>>> and in
>>>> > excess of the 4G per transaction limit. NO RETRY should be the
>>>> default since
>>>> > it preserves the legacy behavior of COPY FROM.
>>>> >
>>>> > You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not
>>>> give the
>>>> > option of sending exceptions to a table since they are presumably
>>>> malformed,
>>>> > otherwise they would not be exceptions. (Users should re-process
>>>> exception
>>>> > files if they want an if good then table a else exception to table b
>>>> ...)
>>>> >
>>>> > EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>>>> >
>>>> >
>>>> >> If we could somehow only do a subtransaction per failure, things
>>>> would
>>>> >> be much better, but I don't see how.
>>>>
>>>> Hello,
>>>>
>>>> Attached is a proof of concept patch for this TODO item. There is no
>>>> docs yet, I just wanted to know if approach is sane.
>>>>
>>>> The added syntax is like the following:
>>>>
>>>> COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
>>>>
>>>> The way it's done it is abusing Copy Both mode and from my limited
>>>> testing, that seems to just work. The error trapping itself is done
>>>> using PG_TRY/PG_CATCH and can only catch formatting or before-insert
>>>> trigger errors, no attempt is made to recover from a failed unique
>>>> constraint, etc.
>>>>
>>>> Example in action:
>>>>
>>>> postgres=# \d test_copy2
>>>> Table "public.test_copy2"
>>>> Column | Type | Modifiers
>>>> --------+---------+-----------
>>>> id | integer |
>>>> val | integer |
>>>>
>>>> postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
>>>> 1
>>>> NOTICE: missing data for column "val"
>>>> CONTEXT: COPY test_copy2, line 1: "1"
>>>> 2
>>>> NOTICE: missing data for column "val"
>>>> CONTEXT: COPY test_copy2, line 2: "2"
>>>> 3
>>>> NOTICE: missing data for column "val"
>>>> CONTEXT: COPY test_copy2, line 3: "3"
>>>> NOTICE: total exceptions ignored: 3
>>>>
>>>> postgres=# \d test_copy1
>>>> Table "public.test_copy1"
>>>> Column | Type | Modifiers
>>>> --------+---------+-----------
>>>> id | integer | not null
>>>>
>>>> postgres=# set client_min_messages to warning;
>>>> SET
>>>> postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
>>>> ...
>>>> vmstat
>>>> zoneinfo
>>>> postgres=#
>>>>
>>>> Limited performance testing shows no significant difference between
>>>> error-catching and plain code path. For example, timing
>>>>
>>>> copy test_copy1 from program 'seq 1000000' [exceptions to stdout]
>>>>
>>>> shows similar numbers with or without the added "exceptions to" clause.
>>>>
>>>> Now that I'm sending this I wonder if the original comment about the
>>>> need for subtransaction around every loaded line still holds. Any
>>>> example of what would be not properly rolled back by just PG_TRY?
>>>>
>>>
>>> this method is unsafe .. exception handlers doesn't free memory usually
>>> - there is risk of memory leaks, source leaks
>>>
>>> you can enforce same performance with block subtransactions - when you
>>> use subtransaction for 1000 rows, then impact of subtransactions is minimal
>>>
>>> when block fails, then you can use row level subtransaction - it works
>>> well when you expect almost correct data.
>>>
>>
>> Two years ago I wrote a extension that did it - but I have not time to
>> finish it and push to upstream.
>>
>> Regards
>>
>> Pavel
>>
>>
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
>>>>
>>>> Happy hacking!
>>>> --
>>>> Alex
>>>>
>>>>
>>>>
>>>> --
>>>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>>
>>>>
>>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2021-12-19 06:23:27 Re: Schema variables - new implementation for Postgres 15
Previous Message Amit Kapila 2021-12-19 03:03:16 Re: sequences vs. synchronous replication