Re: COPY Transform support

From: NikhilS <nikkhils(at)gmail(dot)com>
To: "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 13:59:08
Message-ID: d3c4af540804030659h70e1ed5aw57305b0e69e056e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, Apr 3, 2008 at 6:47 PM, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
wrote:

> Here's a proposal for COPY to support the T part of an ETL, that is adding
> the
> capability for COPY FROM to Transform the data it gets.
>
> The idea is quite simple: adding to COPY FROM the option to run a function
> on
> the data before to call datatype_in functions. This needs some syntax
> addition to be worked out at the COPY side, then the COPY code will have
> to
> run the given function on the read data and consider giving the output of
> it
> to current COPY code (datatype input function).
>
> The function could either get the data as text or bytea, and would have to
> return either text or bytea. bytea seems the more sensible choice, as long
> as
> we don't lose encoding information there, which I'm not sure about.
>
> The syntax could be something like:
> COPY mytable FROM '/my/file.txt' WITH COLUMN x CONVERT USING myfunc;
>
> I tried to only add keywords already present in [1], while getting
> something
> meaningfull... and x is intended to be the column number, counting from 1.
> [1] http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html
>
> Comments?
> --
> dim
>

+1

Data transformation while doing a data load is a requirement now and then.
Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the right way we
should support the following:

* The ability to provide per-column transformation expressions

COPY mytable (col1 transform to "col1 + 10", col2 transform to "'Post' ||
'greSQL'", col3...) FROM ..

* The ability to use any kind of expressions while doing the transformation
The transformation expression should be any expression (basically
ExecEvalExpr) that can be evaluated to give a resulting value and obviously
a corresponding is_null value too. It should and could be system in-built
functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
functions too

* The transformation expression can refer to other columns involved in the
load. So that when the current row is extracted from the input file, the
current values should be used to generate the new resultant values before
doing a heap_form_tuple. E.g.

(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
"UPPER(col1 || col3)",...)

I have spent some thoughts on how to do this and will be happy to share the
same if the list is interested. Personally, I think data transformation
using such expressions is a pretty powerful and important activity while
doing the data load itself.

Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aidan Van Dyk 2008-04-03 14:01:38 Re: modules
Previous Message Greg Sabino Mullane 2008-04-03 13:54:11 Re: modules