Re: Ragged CSV import

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Ragged CSV import
Date: 2009-09-10 03:52:28
Message-ID: 14958.1252554748@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The thought that comes to mind for me is something "in front of" copy,
>> that is, give it the text of each line and let it do a text-to-text
>> transformation before COPY chews on it.

> That seems to me to be a whole lot less useful. As I see it, the
> whole point of any enhancement in this area is to reuse the parsing
> code. If I have a CSV file (or some other format COPY understands),
> I don't want to have to write my own parser for that format in order
> to do some simple data transformation (like dropping columns >n).

I'm unconvinced. A significant fraction of the COPY customization
requests that I can remember had to do with cleaning up input that
didn't entirely match COPY's idea of what valid CSV is. If it's close
enough that COPY will successfully parse it as N columns, you can stick
it into an N-column text table and do your transformations afterwards.

Thinking a bit more generally, it seems to me that as COPY IN is
currently constituted, there are really four distinguishable bits
of functionality in a pipeline:

1. Reading from FE/BE protocol (or file) and conversion to database
character set.

2. Parsing the text into a set of de-escaped field values.

3. Converting the field values to internal Datum form (requires
knowledge of a set of target data types).

4. Bulk insertion into the target table.

It is not hard to come up with problems that are most easily solved
by injecting some sort of filtering between any pair of these steps.
You're suggesting a filter between 2 and 3, I suggested one between
1 and 2, and the well-known trick of inserting into a temp table
and then filtering to a final table amounts to a filter between
3 and 4.

We've had proposals before to come up with a form of COPY that can be
used as a data source in an INSERT/SELECT context. That would
automatically provide a way to filter between steps 3 and 4 --- or maybe
better, insist that the COPY-thingy produces just text fields, and leave
both steps 3 and 4 to be done by the INSERT/SELECT. With that, the only
missing functionality would be a textual filter between steps 1 and 2.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Williams 2009-09-10 04:47:23 Re: Elementary dependency look-up
Previous Message David Fetter 2009-09-10 03:48:52 Re: [HACKERS] CommitFest 2009-09 Plans and Call for Reviewers