Re: COPY Transform support

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 17:23:09
Message-ID: 87sky2g8ea.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
>> Here's a proposal for COPY to support the T part of an ETL, that is adding the
>> capability for COPY FROM to Transform the data it gets.
>
>> The idea is quite simple: adding to COPY FROM the option to run a function on
>> the data before to call datatype_in functions.
>
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.
>
> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything. The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc. What
> would makes sense in my mind is a single function taking and returning
> text, which is invoked once on each complete input line before it is
> broken into fields.

I think not having to deal with separating fields is actually one of the few
reasons to do this within COPY. If you can separate out yourself or need to do
something more clever than COPY is capable of to split the columns then you're
better off preprocessing it with perl or something anyways.

To that end all the other use cases you describe could be handled with his
plan. There's nothing stopping you from doing

CREATE READER foo (a integer, b integer)
INSERT INTO b (SELECT a+b FROM foo);
or
INSERT INTO b (SELECT 1, a, b, greatest(a,b) FROM foo)

However I'm not sure we even need new syntax for CREATE READER. I would think
something like this would make more sense:

CREATE FUNCTION transform(integer, integer) RETURNS SETOF b;

COPY b FROM 'foo' USING transform(integer,integer);

> So the whole thing seems just marginally attractive to me.

Everything about ETL is only marginally attractive, but it's something people
spend a lot of time doing. Nobody's come up with any particularly clean
solutions I think.

AFAIK the state of the art is actually to load the data into a table which
closely matches the source material, sometimes just columns of text. Then copy
it all to another table doing transformations. Not impressed.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message D'Arcy J.M. Cain 2008-04-03 17:27:03 Re: modules
Previous Message Andrew Dunstan 2008-04-03 17:16:39 Re: [GENERAL] SHA1 on postgres 8.3