Re: COPY fast parse patch

From: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
To: <neilc(at)samurai(dot)com>
Cc: <agoldshuv(at)greenplum(dot)com>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: COPY fast parse patch
Date: 2005-06-02 02:12:27
Message-ID: 18637.203.26.206.129.1117678347.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Neil Conway said:
> On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote:
>> 1) The patch includes 2 parallel parsing code paths. One is the
>> regular COPY path that we all know, and the other is the improved one
>> that I wrote. This is only temporary, as there is a lot of code
>> duplication
>
> Right; I really dislike the idea of having two separate code paths for
> COPY. When you say this approach is "temporary", are you suggesting
> that you intend to reimplement your changes as
> improvements/replacements of the existing COPY code path rather than as
> a parallel code path?
>

It's not an all or nothing deal. When we put in CSV handling, we introduced
two new routines for attribute input/output and otherwise used the rest of
the COPY code. When I did a fix for the multiline problem, it was originally
done with a separate read line function for CSV mode - Bruce didn't like
that so I merged it back into the existing code. In restrospect, given this
discussion, that might not have been an optimal choice. But the point is
that you can break out at several levels.

Incidentally, there might be a good case for allowing the user to set the
line end explicitly, but you can't just hardwire it - we will get massive
Windows breakage. What is more, in CSV mode line end sequences can occur
within logical lines. You need to take that into account. It's tricky and
easy to get badly wrong.

I will be the first to admit that there are probably some very good
possibilities for optimisation of this code. My impression though has been
that in almost all cases it's fast enough anyway. I know that on some very
modest hardware I have managed to load a 6m row TPC line-items table in just
a few minutes. Before we start getting too hung up, I'd be interested to
know just how much data people want to load and how fast they want it to be.
If people have massive data loads that take hours, days or weeks then it's
obviously worth improving if we can. I'm curious to know what size datasets
people are really handling this way.

cheers

andrew

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2005-06-02 03:54:08 Re: Backslash handling in strings
Previous Message Neil Conway 2005-06-02 01:42:29 Re: COPY fast parse patch