Re: copy vs. C function

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: copy vs. C function
Date: 2011-12-12 16:38:04
Message-ID: CAHyXU0yC8Q3JihEws=naH0p9Mh5MyLLu7yXAjp9gE0sQORN-fA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql(at)jamponi(dot)net> wrote:
> I was experimenting with a few different methods of taking a line of
> text, parsing it, into a set of fields, and then getting that info
> into a table.
>
> The first method involved writing a C program to parse a file, parse
> the lines and output newly-formatted lines in a format that
> postgresql's COPY function can use.
> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
> output new data to new file -- 4 seconds, COPY new file -- 10
> seconds).
>
> The next approach I took was to write a C function in postgresql to
> parse a single TEXT datum into an array of C strings, and then use
> BuildTupleFromCStrings. There are 8 columns involved.
> Eliding the time it takes to COPY the (raw) file into a temporary
> table, this method took 120 seconds, give or take.
>
> The difference was /quite/ a surprise to me. What is the probability
> that I am doing something very, very wrong?
>
> NOTE: the code that does the parsing is actually the same,
> line-for-line, the only difference is whether the routine is called by
> a postgresql function or by a C program via main, so obviously the
> overhead is elsewhere.
> NOTE #2: We are talking about approximately 2.6 million lines.

Let me throw out an interesting third method I've been using to parse
delimited text files that might be useful in your case. This is
useful when parsing text that is bad csv where values are not escaped
or there are lines, incomplete and/or missing records, or a huge
amount of columns that you want to rotate into a more normalized
structure based on columns position.

1. Import the data into a single column (containing the entire line)
staging table, feeding the COPY parser a bogus delimiter
2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
you can work it into your problem, INSERT/SELECT, expanding the array
with a trick like used in information_schema._pg_expandarray so you
can hook logic on the array (column position).

merlin

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Aleksej Trofimov 2011-12-13 13:55:02 Postgres array parser
Previous Message Tom Lane 2011-12-12 15:26:10 Re: autovacuum, exclude table