Re: COPY Transform support

From: PFC <lists(at)peufeu(dot)com>
To: NikhilS <nikkhils(at)gmail(dot)com>, "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 14:44:56
Message-ID: op.t81hw6jmcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> Data transformation while doing a data load is a requirement now and
> then.
> Considering that users will have to do mass updates *after* the load
> completes to mend the data to their liking should be reason enough to do
> this while the loading is happening. I think to go about it the right
> way we
> should support the following:

> * The ability to provide per-column transformation expressions
> * The ability to use any kind of expressions while doing the
> transformation
> The transformation expression should be any expression (basically
> ExecEvalExpr) that can be evaluated to give a resulting value and
> obviously
> a corresponding is_null value too. It should and could be system in-built
> functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
> functions too
> * The transformation expression can refer to other columns involved in
> the
> load. So that when the current row is extracted from the input file, the
> current values should be used to generate the new resultant values before
> doing a heap_form_tuple. E.g.
> (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
> "UPPER(col1 || col3)",...)
> I have spent some thoughts on how to do this and will be happy to share
> the
> same if the list is interested. Personally, I think data transformation
> using such expressions is a pretty powerful and important activity while
> doing the data load itself.

Well, since COPY is about as fast as INSERT INTO ... SELECT plus the
parsing overead, I suggest adding a special SELECT form that can read from
a file instead of a table, which returns tuples, and which therefore can
be used and abused to the user's liking. This is a much more powerful
feature because :

- there is almost no new syntax
- it is much simpler for the user
- lots of existing stuff can be leveraged

EXAMPLE :

Suppose I want to import a MySQL dump file (gasp !) which obviously
contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken
foreign keys, etc.

Let's have a new command :

CREATE FLATFILE READER mydump (
id INTEGER,
date TEXT,
...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;

This command would create a set-returning function which is basically a
wrapper around the existing parser in COPY.
Column definition gives a name and type to the fields in the text file,
and tells the parser what to expect and what to return.
It looks like a table definition, and this is actually pretty normal : it
is, after all, very close to a table.

INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00'
), ... FROM mydump WHERE (FKs check and drop the borken records);

Now I can import data and transform it at will using a simple SELECT. The
advantage is that everybody will know what to do without learning a new
command, no awkward syntax (transform...), you can combine columns in
expressions, JOIN to ckeck FKs, use ORDER to get a clustered table,
anything you want, without any extension to the Postgres engine besides
the creation of this file-parsing set-returning function, which should be
pretty simple.

Or, if I have a few gigabytes of logs, but I am absolutely not interested
in inserting them into a table, instead I want to make some statistics, or
perhaps I want to insert into my table some aggregate computation from
this data, I would just :

CREATE FLATFILE READER accesses_dump (
date TEXT,
ip INET,
...
) FROM file 'web_server_logtxt';

And I can do some stats without even loading the data :

SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)
HAVING count(*) > 1000;

Much better than having to load those gigabytes just to make a query on
them...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-04-03 14:45:17 Re: Patch queue -> wiki (was varadic patch)
Previous Message Tom Lane 2008-04-03 14:35:54 Re: psql \G command -- send query and output using extended format