Re: COPY FROM performance improvements

From: Mark Wong <markw(at)osdl(dot)org>
To: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-07-19 21:37:58
Message-ID: 200507192137.j6JLbZjA011011@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

Hi Alon,

Yeah, that helps. I just need to break up my scripts a little to just
load the data and not build indexes.

Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? ;) Here's a link to my script to create
tables:
http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in

File sizes:
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl

Number of rows:
# wc -l *.tbl
15000000 customer.tbl
600037902 lineitem.tbl
25 nation.tbl
150000000 orders.tbl
20000000 part.tbl
80000000 partsupp.tbl
5 region.tbl
1000000 supplier.tbl

Thanks,
Mark

On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:

> Hi Mark,
>
> I improved the data *parsing* capabilities of COPY, and didn't touch the
> data conversion or data insertion parts of the code. The parsing improvement
> will vary largely depending on the ratio of parsing -to- converting and
> inserting.
>
> Therefore, the speed increase really depends on the nature of your data:
>
> 100GB file with
> long data rows (lots of parsing)
> Small number of columns (small number of attr conversions per row)
> less rows (less tuple insertions)
>
> Will show the best performance improvements.
>
> However, same file size 100GB with
> Short data rows (minimal parsing)
> large number of columns (large number of attr conversions per row)
> AND/OR
> more rows (more tuple insertions)
>
> Will show improvements but not as significant.
> In general I'll estimate 40%-95% improvement in load speed for the 1st case
> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> etc... This is for TEXT format. As for CSV, it may be faster but not as much
> as I specified here. BINARY will stay the same as before.
>
> HTH
> Alon.
>
>
>
>
>
>
> On 7/19/05 12:54 PM, "Mark Wong" <markw(at)osdl(dot)org> wrote:
>
> > On Thu, 14 Jul 2005 17:22:18 -0700
> > "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
> >
> >> I revisited my patch and removed the code duplications that were there, and
> >> added support for CSV with buffered input, so CSV now runs faster too
> >> (although it is not as optimized as the TEXT format parsing). So now
> >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
> >
> > Hi Alon,
> >
> > I'm curious, what kind of system are you testing this on? I'm trying to
> > load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
> > interested in the results you would expect.
> >
> > Mark
> >
>

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Jim C. Nasby 2005-07-19 21:51:03 Re: [HACKERS] Patch to fix plpython on OS X
Previous Message Andrew Dunstan 2005-07-19 21:13:26 Re: [HACKERS] Patch to fix plpython on OS X

Browse pgsql-performance by date

  From Date Subject
Next Message Dario 2005-07-19 21:41:18 Re: join and query planner
Previous Message Sven Willenberger 2005-07-19 21:15:08 Re: Looking for tips