Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>
To:
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2021-12-18 08:55:24
Message-ID: CALH1LgsKDk4XthrM4m8tAqfoPyNgDCSFV3Go1nevXze-8dwKrA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello.

Wrote a patch implementing COPY with ignoring errors in rows using block
subtransactions.

Syntax: COPY [table] FROM [file/stdin] WITH IGNORE_ERROS;

Examples:
CREATE TABLE check_ign_err (n int, m int, k int);
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2 2 2
3 3 3
\.
WARNING: COPY check_ign_err, line 2: "2 2 2 2"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 2
3 3 3
\.
WARNING: COPY check_ign_err, line 2: "2 2"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

##################################################

TRUNCATE check_ign_err;
COPY check_ign_err FROM STDIN WITH IGNORE_ERRORS;
1 1 1
2 a 2
3 3 3
\.
WARNING: COPY check_ign_err, line 2, column m: "a"
SELECT * FROM check_ign_err;
n | m | k
---+---+---
1 | 1 | 1
3 | 3 | 3
(2 rows)

Regards, Damir

пт, 10 дек. 2021 г. в 21:48, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:

>
>
> 2014-12-26 11:41 GMT+01:00 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>
>>
>>
>> 2014-12-25 22:23 GMT+01:00 Alex Shulgin <ash(at)commandprompt(dot)com>:
>>
>>> Trent Shipley <trent_shipley(at)qwest(dot)net> writes:
>>>
>>> > On Friday 2007-12-14 16:22, Tom Lane wrote:
>>> >> Neil Conway <neilc(at)samurai(dot)com> writes:
>>> >> > By modifying COPY: COPY IGNORE ERRORS or some such would instruct
>>> COPY
>>> >> > to drop (and log) rows that contain malformed data. That is, rows
>>> with
>>> >> > too many or too few columns, rows that result in constraint
>>> violations,
>>> >> > and rows containing columns where the data type's input function
>>> raises
>>> >> > an error. The last case is the only thing that would be a bit
>>> tricky to
>>> >> > implement, I think: you could use PG_TRY() around the
>>> InputFunctionCall,
>>> >> > but I guess you'd need a subtransaction to ensure that you reset
>>> your
>>> >> > state correctly after catching an error.
>>> >>
>>> >> Yeah. It's the subtransaction per row that's daunting --- not only
>>> the
>>> >> cycles spent for that, but the ensuing limitation to 4G rows imported
>>> >> per COPY.
>>> >
>>> > You could extend the COPY FROM syntax with a COMMIT EVERY n clause.
>>> This
>>> > would help with the 4G subtransaction limit. The cost to the ETL
>>> process is
>>> > that a simple rollback would not be guaranteed send the process back
>>> to it's
>>> > initial state. There are easy ways to deal with the rollback issue
>>> though.
>>> >
>>> > A {NO} RETRY {USING algorithm} clause might be useful. If the NO
>>> RETRY
>>> > option is selected then the COPY FROM can run without subtransactions
>>> and in
>>> > excess of the 4G per transaction limit. NO RETRY should be the
>>> default since
>>> > it preserves the legacy behavior of COPY FROM.
>>> >
>>> > You could have an EXCEPTIONS TO {filename|STDERR} clause. I would not
>>> give the
>>> > option of sending exceptions to a table since they are presumably
>>> malformed,
>>> > otherwise they would not be exceptions. (Users should re-process
>>> exception
>>> > files if they want an if good then table a else exception to table b
>>> ...)
>>> >
>>> > EXCEPTIONS TO and NO RETRY would be mutually exclusive.
>>> >
>>> >
>>> >> If we could somehow only do a subtransaction per failure, things would
>>> >> be much better, but I don't see how.
>>>
>>> Hello,
>>>
>>> Attached is a proof of concept patch for this TODO item. There is no
>>> docs yet, I just wanted to know if approach is sane.
>>>
>>> The added syntax is like the following:
>>>
>>> COPY [table] FROM [file/program/stdin] EXCEPTIONS TO [file or stdout]
>>>
>>> The way it's done it is abusing Copy Both mode and from my limited
>>> testing, that seems to just work. The error trapping itself is done
>>> using PG_TRY/PG_CATCH and can only catch formatting or before-insert
>>> trigger errors, no attempt is made to recover from a failed unique
>>> constraint, etc.
>>>
>>> Example in action:
>>>
>>> postgres=# \d test_copy2
>>> Table "public.test_copy2"
>>> Column | Type | Modifiers
>>> --------+---------+-----------
>>> id | integer |
>>> val | integer |
>>>
>>> postgres=# copy test_copy2 from program 'seq 3' exceptions to stdout;
>>> 1
>>> NOTICE: missing data for column "val"
>>> CONTEXT: COPY test_copy2, line 1: "1"
>>> 2
>>> NOTICE: missing data for column "val"
>>> CONTEXT: COPY test_copy2, line 2: "2"
>>> 3
>>> NOTICE: missing data for column "val"
>>> CONTEXT: COPY test_copy2, line 3: "3"
>>> NOTICE: total exceptions ignored: 3
>>>
>>> postgres=# \d test_copy1
>>> Table "public.test_copy1"
>>> Column | Type | Modifiers
>>> --------+---------+-----------
>>> id | integer | not null
>>>
>>> postgres=# set client_min_messages to warning;
>>> SET
>>> postgres=# copy test_copy1 from program 'ls /proc' exceptions to stdout;
>>> ...
>>> vmstat
>>> zoneinfo
>>> postgres=#
>>>
>>> Limited performance testing shows no significant difference between
>>> error-catching and plain code path. For example, timing
>>>
>>> copy test_copy1 from program 'seq 1000000' [exceptions to stdout]
>>>
>>> shows similar numbers with or without the added "exceptions to" clause.
>>>
>>> Now that I'm sending this I wonder if the original comment about the
>>> need for subtransaction around every loaded line still holds. Any
>>> example of what would be not properly rolled back by just PG_TRY?
>>>
>>
>> this method is unsafe .. exception handlers doesn't free memory usually -
>> there is risk of memory leaks, source leaks
>>
>> you can enforce same performance with block subtransactions - when you
>> use subtransaction for 1000 rows, then impact of subtransactions is minimal
>>
>> when block fails, then you can use row level subtransaction - it works
>> well when you expect almost correct data.
>>
>
> Two years ago I wrote a extension that did it - but I have not time to
> finish it and push to upstream.
>
> Regards
>
> Pavel
>
>
>>
>> Regards
>>
>> Pavel
>>
>>
>>>
>>> Happy hacking!
>>> --
>>> Alex
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-hackers
>>>
>>>
>>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment Content-Type Size
0001-COPY-IGNORE_ERRORS-option.patch text/x-patch 35.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-12-18 09:30:23 Re: Addition of --no-sync to pg_upgrade for test speedup
Previous Message Amit Kapila 2021-12-18 06:38:34 Re: parallel vacuum comments