Re: Add header support to text format and matching feature

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Rémi Lapeyre" <remi(dot)lapeyre(at)lenstra(dot)fr>
Cc: "vignesh C" <vignesh21(at)gmail(dot)com>,"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add header support to text format and matching feature
Date: 2020-07-25 13:34:22
Message-ID: 80a9b594-01d6-4ee4-a612-9ae9fd3c1194@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Rémi Lapeyre wrote:

> Here’s a new version that fix all the issues.

Here's a review of v4.

The patch improves COPY in two ways:

- COPY TO and COPY FROM now accept "HEADER ON" for the TEXT format
(previously it was only for CSV)

- COPY FROM also accepts "HEADER match" to tell that there's a header
and that its contents must match the columns of the destination table.
This works for both the CSV and TEXT formats. The syntax for the
columns is the same as for the data and the match is case-sensitive.

The first improvement when submitted alone (in 2018 by Simon Muller)
has been judged not useful enough or even hazardous without any
"match" feature. It was returned with feedback in 2018-10 and
resubmitted by Rémi in 2020-02 with the match feature.

The patches apply cleanly, "make check" and "make check-world" pass.

In my tests it works fine except for one crash that I can reproduce
on a fresh build and default configuration with:

$ cat >file.txt
i
1

$ psql
postgres=# create table x(i int);
CREATE TABLE
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
COPY 1
postgres=# \copy x(i) from file.txt with (header match)
PANIC: ERRORDATA_STACK_SIZE exceeded
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I suspect the reason is the way that PG_TRY/PG_CATCH is used, see below.

Code comments:

+/*
+ * Represents whether the header must be absent, present or present and
match.
+ */
+typedef enum CopyHeader
+{
+ COPY_HEADER_ABSENT,
+ COPY_HEADER_PRESENT,
+ COPY_HEADER_MATCH
+} CopyHeader;
+
/*
* This struct contains all the state variables used throughout a COPY
* operation. For simplicity, we use the same struct for all variants of
COPY,
@@ -136,7 +146,7 @@ typedef struct CopyStateData
bool binary; /* binary format? */
bool freeze; /* freeze rows on loading? */
bool csv_mode; /* Comma Separated Value
format? */
- bool header_line; /* CSV or text header line? */
+ CopyHeader header_line; /* CSV or text header line? */

After the redefinition into this enum type, there are still a
bunch of references to header_line that treat it like a boolean:

1190: if (cstate->header_line)
1398: if (cstate->binary && cstate->header_line)
2119: if (cstate->header_line)
3635: if (cstate->cur_lineno == 0 && cstate->header_line)

It works fine since COPY_HEADER_ABSENT is 0 as the first value of the enum,
but maybe it's not good style to count on that.

+ PG_TRY();
+ {
+ if (defGetBoolean(defel))
+ cstate->header_line =
COPY_HEADER_PRESENT;
+ else
+ cstate->header_line =
COPY_HEADER_ABSENT;
+ }
+ PG_CATCH();
+ {
+ char *sval = defGetString(defel);
+
+ if (!cstate->is_copy_from)
+ PG_RE_THROW();
+
+ if (pg_strcasecmp(sval, "match") == 0)
+ cstate->header_line =
COPY_HEADER_MATCH;
+ else
+ ereport(ERROR,
+
(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("header requires a
boolean or \"match\"")));
+ }
+ PG_END_TRY();

It seems wrong to use a PG_CATCH block for this. I understand that
it's because defGetBoolean() calls ereport() on non-booleans, but then
it should be split into an error-throwing function and a
non-error-throwing lexical analysis of the boolean, the above code
calling the latter.
Besides the comments in elog.h above PG_TRY say that
"the error recovery code
can either do PG_RE_THROW to propagate the error outwards, or do a
(sub)transaction abort. Failure to do so may leave the system in an
inconsistent state for further processing."
Maybe this is what happens with the repeated uses of "match"
eventually failing with ERRORDATA_STACK_SIZE exceeded.

- HEADER [ <replaceable class="parameter">boolean</replaceable> ]
+ HEADER { <literal>match</literal> | <literal>true</literal> |
<literal>false</literal> }

This should be enclosed in square brackets because HEADER
with no argument is still accepted.

+ names from the table. On input, the first line is discarded when set
+ to <literal>true</literal> or required to match the column names if
set

The elision of "header" as the subject might be misinterpreted as if
it's the first line that is true. I'd suggest
"when <literal>header>/literal> is set to ..." to avoid any confusion.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-07-25 15:11:59 Re: INSERT INTO SELECT, Why Parallelism is not selected?
Previous Message vignesh C 2020-07-25 13:26:31 Re: proposal: possibility to read dumped table's name from file