Re: Conflict handling for COPY FROM

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Surafel Temesgen <surafel3000(at)gmail(dot)com>
Cc: "asaba(dot)takanori(at)fujitsu(dot)com" <asaba(dot)takanori(at)fujitsu(dot)com>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Nowocien <anowocien(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Conflict handling for COPY FROM
Date: 2020-03-27 22:27:32
Message-ID: 28444.1585348052@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Surafel Temesgen <surafel3000(at)gmail(dot)com> writes:
> [ conflict-handling-copy-from-v16.patch ]

I took a quick look at this patch, since it was marked "ready for
committer", but I don't see how it can possibly be considered committable.

1. Covering only the errors that are thrown in DoCopy itself doesn't
seem to me to pass the smell test. Yes, I'm sure there's some set of
use-cases for which that'd be helpful, but I think most people would
expect a "skip errors" option to be able to handle cases like malformed
numbers or bad encoding. I understand the implementation reasons that
make it impractical to cover other errors, but do we really want a
feature that most people will see as much less than half-baked? I fear
it'll be an embarrassment.

2. If I'm reading the patch correctly, (some) rejected rows are actually
sent back to the client. This is a wire protocol break of the first
magnitude, and can NOT be accepted. At least not without some provisions
for not doing it with a client that isn't prepared for it. I also am
fairly worried about the possibilities for deadlock (ie, both ends stuck
waiting for the other one to absorb data) if the return traffic volume is
high enough.

3. I don't think enough thought has been put into the reporting, either.

+ ereport(WARNING,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("skipping \"%s\" --- extra data after last expected column",
+ cstate->line_buf.data)));

That's not going to be terribly helpful if the input line runs to many
megabytes. Or even if no individual line is very long, but you get
millions of such warnings. It's pretty much at variance with our
message style guidelines (among other things, those caution you to keep
the primary error message short); and it's randomly different from
COPY's existing practice, which is to show the faulty line as CONTEXT.
Plus it seems plenty weird that some errors are reported this way while
others are reported by sending back the bad tuple (with, it looks like,
no mention of what the specific problem is ... what if you have a lot of
unique indexes?).

BTW, while I don't know much about the ON CONFLICT (speculative
insertion) infrastructure, I wonder how well it really works to
not specify an arbiter index. I see that you're getting away with
it in a trivial test case that has exactly one index, but that's
not stressing the point very hard.

On the whole, I feel like adding this sort of functionality to COPY
itself is a dead end. COPY is meant for fast bulk transfer and not
much else; trying to load more functionality onto it can only end
in serving multiple masters poorly. What we normally recommend
if you have data that needs to be cleaned is to import it into a
permissively-defined staging table (eg, with all columns declared
as text) and then transfer cleaned data to your tables-of-record.
Looking at this patch in terms of whether the functionality is
available in that approach, it seems like you might want two parts
of it:

1. A COPY option to be flexible about the number of columns in the
input, say by filling omitted columns with NULLs.

2. INSERT ... ON CONFLICT can be used to transfer data to permanent
tables with rejection of duplicate keys, but it doesn't have much
flexibility about just what to do with duplicates. Maybe we could
add more ON CONFLICT actions? Sending conflicted rows to some other
table, or updating the source table to show which rows were copied
and which not, might be useful things to think about.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2020-03-27 22:32:55 Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
Previous Message David Steele 2020-03-27 22:26:32 Re: pgbench - refactor init functions with buffers