Re: COPY formatting

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Karel Zak <zakkr(at)zf(dot)jcu(dot)cz>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY formatting
Date: 2004-03-18 15:16:36
Message-ID: 9056.1079622996@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Karel Zak <zakkr(at)zf(dot)jcu(dot)cz> writes:
>> On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
>>> Karel Zak <zakkr(at)zf(dot)jcu(dot)cz> writes:
>>>> This seems like it could only reasonably be implemented as a C function.
>>
>> Why? I said it's pseudo code. It should use standard fmgr API like
>> every other PostgreSQL function or is it problem and I overlook
>> something? It must to support arbitrary programming language and not
>> C only.

Sure, but the question is whether the *stuff it has to do* can
reasonably be coded in anything but C. Why are you passing in a
relation OID, if not for lookups in relcache entries that are simply
not accessible above the C level? (Don't tell me you want the function
to do a bunch of actual SELECTs from system catalogs for every line
of the copy...)

Passing in a relation OID is probably a bad idea anyway, as it ties this
API to the assumption that COPY is only for complete relations. There's
been talk before of allowing a SELECT result to be presented via the
COPY protocol, for instance. What might be a more usable API is

COPY OUT:
function formatter_out(text[]) returns text
COPY IN:
function formatter_in(text) returns text[]

where the text array is either the results of or the input to the
per-column datatype I/O routines. This makes it explicit that the
formatter's job is solely to determine the column-level wrapping and
unwrapping of the data. I'm assuming here that there is no good reason
for the formatter to care about the specific datatypes involved; can you
give a counterexample?

> It's pity that main idea of current COPY is based on separated lines
> and it is not more common interface for streaming data between FE and BE.

Yeah, that was another concern I had. This API would let the formatter
control line-level layout but it would not eliminate the hard-wired
significance of newline. What's worse, there isn't any clean way to
deal with reading quoted newlines --- the formatter can't really replace
the default quoting rules if the low-level code is going to decide
whether a newline is quoted or not.

We could possibly solve that by specifying that the text output or input
(respectively) is the complete line sent to or from the client,
including newline or whatever other line-level formatting you are using.
This still leaves the problem of how the low-level COPY IN code knows
what is a complete line to pass off to the formatter_in routine. We
could possibly fix this by adding a second input-control routine

function formatter_linelength(text) returns integer

which is defined to return -1 if the input isn't a complete line yet
(i.e., read some more data, append to the buffer, and try again), or
>= 0 to indicate that the first N bytes of the buffer represent a
complete line to be passed off to formatter_in. I don't see a way to
combine formatter_in and formatter_linelength into a single function
without relying on "out" parameters, which would again confine the
feature to format functions written in C.

It's a tad annoying that we need two functions for input. One way that
we could still keep the COPY option syntax to be just
FORMAT csv
is to create an arbitrary difference in the signatures of the input
functions. Then we could have coexisting functions
csv(text[]) returns text
csv(text) returns text[]
csv(text, ...) returns int
that are referenced by "FORMAT csv".

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-03-18 15:25:45 Re: Problem on cluster initialization
Previous Message Bruce Momjian 2004-03-18 14:43:36 Re: COPY formatting