Re: Add header support to text format and matching feature

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Rémi Lapeyre <remi(dot)lapeyre(at)lenstra(dot)fr>, Zhihong Yu <zyu(at)yugabyte(dot)com>, David Steele <david(at)pgmasters(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: Add header support to text format and matching feature
Date: 2022-06-07 15:47:44
Message-ID: 20220607154744.vvmitnqhyxrne5ms@jrouhaud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Mar 30, 2022 at 09:11:09AM +0200, Peter Eisentraut wrote:
>
> Committed, after some further refinements as discussed.

While working on nearby code, I found some problems with this feature.

First, probably nitpicking, the HEADER MATCH is allowed for COPY TO, is that
expected? The documentation isn't really explicit about it, but there's
nothing to match when exporting data it's a bit surprising. I'm not opposed to
have HEADER MATCH means HEADER ON for COPY TO, as as-is one can easily reuse
the commands history, but maybe it should be clearly documented?

Then, apparently HEADER MATCH doesn't let you do sanity checks against a custom
column list. This one looks like a clear oversight, as something like that
should be entirely valid IMHO:

CREATE TABLE tbl(col1 int, col2 int);
COPY tbl (col2, col1) TO '/path/to/file' WITH (HEADER MATCH);
COPY tbl (col2, col1) FROM '/path/to/file' WITH (HEADER MATCH);

but right now it errors out with:

ERROR: column name mismatch in header line field 1: got "col1", expected "col2"

Note that the error message is bogus if you specify attributes in a
different order from the relation, as the code is mixing access to the tuple
desc and access to the raw fields with the same offset.

This also means that it will actually fail to detect a mismatch in the provided
column list and let you import data in the wrong position as long as the
datatypes are compatible and the column header in the file are in the correct
order. For instance:

CREATE TABLE abc (a text, b text, c text);
INSERT INTO abc SELECT 'a', 'b', 'c';
COPY abc TO '/path/to/file' WITH (HEADER MATCH);

You can then import the data with any of those:
COPY abc(c, b, a) TO '/path/to/file' WITH (HEADER MATCH);
COPY abc(c, a, b) TO '/path/to/file' WITH (HEADER MATCH);
[...]
SELECT * FROM abc;

Even worse, if you try to do a COPY ... FROM ... WITH (HEADER ON) on a table
that has some dropped attribute(s). The current code will access random memory
as there's no exact attnum / raw field mapping anymore.

I can work on a fix if needed (with some additional regression test to cover
those cases), but I'm still not sure that having a user provided column list is
supposed to be accepted or not for the HEADER MATCH. In the meantime I will
add an open item.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-06-07 16:26:01 Re: pgcon unconference / impact of block size on performance
Previous Message Rod Taylor 2022-06-07 15:47:29 Re: Collation version tracking for macOS