Quick Links

Re: COPY FROM performance improvements

From:	Mark Wong <markw(at)osdl(dot)org>
To:	"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: COPY FROM performance improvements
Date:	2005-07-19 21:37:58
Message-ID:	200507192137.j6JLbZjA011011@smtp.osdl.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-patches pgsql-performance

Hi Alon,

Yeah, that helps. I just need to break up my scripts a little to just
load the data and not build indexes.

Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? ;) Here's a link to my script to create
tables:
http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in

File sizes:
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl

Number of rows:
# wc -l *.tbl
15000000 customer.tbl
600037902 lineitem.tbl
25 nation.tbl
150000000 orders.tbl
20000000 part.tbl
80000000 partsupp.tbl
5 region.tbl
1000000 supplier.tbl

Thanks,
Mark

On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:

> Hi Mark,
>
> I improved the data *parsing* capabilities of COPY, and didn't touch the
> data conversion or data insertion parts of the code. The parsing improvement
> will vary largely depending on the ratio of parsing -to- converting and
> inserting.
>
> Therefore, the speed increase really depends on the nature of your data:
>
> 100GB file with
> long data rows (lots of parsing)
> Small number of columns (small number of attr conversions per row)
> less rows (less tuple insertions)
>
> Will show the best performance improvements.
>
> However, same file size 100GB with
> Short data rows (minimal parsing)
> large number of columns (large number of attr conversions per row)
> AND/OR
> more rows (more tuple insertions)
>
> Will show improvements but not as significant.
> In general I'll estimate 40%-95% improvement in load speed for the 1st case
> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> etc... This is for TEXT format. As for CSV, it may be faster but not as much
> as I specified here. BINARY will stay the same as before.
>
> HTH
> Alon.
>
>
>
>
>
>
> On 7/19/05 12:54 PM, "Mark Wong" <markw(at)osdl(dot)org> wrote:
>
> > On Thu, 14 Jul 2005 17:22:18 -0700
> > "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
> >
> >> I revisited my patch and removed the code duplications that were there, and
> >> added support for CSV with buffered input, so CSV now runs faster too
> >> (although it is not as optimized as the TEXT format parsing). So now
> >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
> >
> > Hi Alon,
> >
> > I'm curious, what kind of system are you testing this on? I'm trying to
> > load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
> > interested in the results you would expect.
> >
> > Mark
> >
>

In response to

Re: COPY FROM performance improvements at 2005-07-19 21:05:56 from Alon Goldshuv

Responses

Re: COPY FROM performance improvements at 2005-07-19 22:06:17 from Alon Goldshuv
Re: COPY FROM performance improvements at 2005-07-19 22:17:52 from Andrew Dunstan

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Jim C. Nasby	2005-07-19 21:51:03	Re: [HACKERS] Patch to fix plpython on OS X
Previous Message	Andrew Dunstan	2005-07-19 21:13:26	Re: [HACKERS] Patch to fix plpython on OS X

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Dario	2005-07-19 21:41:18	Re: join and query planner
Previous Message	Sven Willenberger	2005-07-19 21:15:08	Re: Looking for tips