Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Damir <dam(dot)bel07(at)gmail(dot)com>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, anisimow(dot)d(at)gmail(dot)com, HukuToc(at)gmail(dot)com, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2023-11-08 20:12:16
Message-ID: 752672.1699474336@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Daniel Gustafsson <daniel(at)yesql(dot)se> writes:
>> On 8 Nov 2023, at 19:18, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I think an actually usable feature of this sort would involve
>> copying all the failed lines to some alternate output medium,
>> perhaps a second table with a TEXT column to receive the original
>> data line. (Or maybe an array of text that could receive the
>> broken-down field values?) Maybe we could dump the message info,
>> line number, field name etc into additional columns.

> I agree that the errors should be easily visible to the user in some way. The
> feature is for sure interesting, especially in data warehouse type jobs where
> dirty data is often ingested.

I agree it's interesting, but we need to get it right the first time.

Here is a very straw-man-level sketch of what I think might work.
The option to COPY FROM looks something like

ERRORS TO other_table_name (item [, item [, ...]])

where the "items" are keywords identifying the information item
we will insert into each successive column of the target table.
This design allows the user to decide which items are of use
to them. I envision items like

LINENO bigint COPY line number, counting from 1
LINE text raw text of line (after encoding conversion)
FIELDS text[] separated, de-escaped string fields (the data
that was or would be fed to input functions)
FIELD text name of troublesome field, if field-specific
MESSAGE text error message text
DETAIL text error message detail, if any
SQLSTATE text error SQLSTATE code

Some of these would have to be populated as NULL if we didn't get
that far in processing the line. In the worst case, which is
encoding conversion failure, I think we couldn't populate any of
the data items except LINENO.

Not sure if we need to insist that the target table columns be
exactly the data types I show above. It'd be nice to allow
the LINENO target to be plain int, perhaps. OTOH, do we really
want to have to deal with issues like conversion failures while
trying to report an error?

> As a data point, Greenplum has this feature with additional SQL syntax to
> control it:
> COPY .. LOG ERRORS SEGMENT REJECT LIMIT xyz ROWS;
> LOG ERRORS instructs the database to log the faulty rows and SEGMENT REJECT
> LIMIT xyz ROWS sets the limit of how many rows can be faulty before the
> operation errors out. I'm not at all advocating that we should mimic this,
> just wanted to add a reference to postgres derivative where this has been
> implemented.

Hm. A "reject limit" might be a useful add-on, but I wouldn't advocate
including it in the initial patch.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-11-08 20:49:59 Re: XID-wraparound hazards in LISTEN/NOTIFY
Previous Message Tom Lane 2023-11-08 19:58:14 Re: XX000: tuple concurrently deleted during DROP STATISTICS