Skip site navigation (1) Skip section navigation (2)

Re: COPY Transform support

From: PFC <lists(at)peufeu(dot)com>
To: NikhilS <nikkhils(at)gmail(dot)com>, "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 14:44:56
Message-ID: op.t81hw6jmcigqcu@apollo13.peufeu.com (view raw or flat)
Thread:
Lists: pgsql-hackers
> Data transformation while doing a data load is a requirement now and  
> then.
> Considering that users will have to do mass updates *after* the load
> completes to mend the data to their liking should be reason enough to do
> this while the loading is happening. I think to go about it the right  
> way we
> should support the following:

> * The ability to provide per-column transformation expressions
> * The ability to use any kind of expressions while doing the  
> transformation
> The transformation expression should be any expression (basically
> ExecEvalExpr) that can be evaluated to give a resulting value and  
> obviously
> a corresponding is_null value too. It should and could be system in-built
> functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
> functions too
> * The transformation expression can refer to other columns involved in  
> the
> load. So that when the current row is extracted from the input file, the
> current values should be used to generate the new resultant values before
> doing a heap_form_tuple. E.g.
> (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
> "UPPER(col1 || col3)",...)
> I have spent some thoughts on how to do this and will be happy to share  
> the
> same if the list is interested. Personally, I think data transformation
> using such expressions is a pretty powerful and important activity while
> doing the data load itself.


	Well, since COPY is about as fast as INSERT INTO ... SELECT plus the  
parsing overead, I suggest adding a special SELECT form that can read from  
a file instead of a table, which returns tuples, and which therefore can  
be used and abused to the user's liking. This is a much more powerful  
feature because :

	- there is almost no new syntax
	- it is much simpler for the user
	- lots of existing stuff can be leveraged

	EXAMPLE :

	Suppose I want to import a MySQL dump file (gasp !) which obviously  
contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken  
foreign keys, etc.

	Let's have a new command :

CREATE FLATFILE READER mydump (
	id	INTEGER,
	date	TEXT,
	...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;

	This command would create a set-returning function which is basically a  
wrapper around the existing parser in COPY.
	Column definition gives a name and type to the fields in the text file,  
and tells the parser what to expect and what to return.
	It looks like a table definition, and this is actually pretty normal : it  
is, after all, very close to a table.

	INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00'  
), ... FROM mydump WHERE (FKs check and drop the borken records);

	Now I can import data and transform it at will using a simple SELECT. The  
advantage is that everybody will know what to do without learning a new  
command, no awkward syntax (transform...), you can combine columns in  
expressions, JOIN to ckeck FKs, use ORDER to get a clustered table,  
anything you want, without any extension to the Postgres engine besides  
the creation of this file-parsing set-returning function, which should be  
pretty simple.

	Or, if I have a few gigabytes of logs, but I am absolutely not interested  
in inserting them into a table, instead I want to make some statistics, or  
perhaps I want to insert into my table some aggregate computation from  
this data, I would just :

CREATE FLATFILE READER accesses_dump (
	date	TEXT,
	ip	INET,
	...
) FROM file 'web_server_logtxt';

	And I can do some stats without even loading the data :

	SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)  
HAVING count(*) > 1000;

	Much better than having to load those gigabytes just to make a query on  
them...































In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2008-04-03 14:45:17
Subject: Re: Patch queue -> wiki (was varadic patch)
Previous:From: Tom LaneDate: 2008-04-03 14:35:54
Subject: Re: psql \G command -- send query and output using extended format

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group