Re: COPY FROM performance improvements

From: Mark Wong <markw(at)osdl(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Alon Goldshuv <agoldshuv(at)greenplum(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-07-19 22:51:33
Message-ID: 200507192251.j6JMpAjA016147@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

Whoopsies, yeah good point about the PRIMARY KEY. I'll fix that.

Mark

On Tue, 19 Jul 2005 18:17:52 -0400
Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> Mark,
>
> You should definitely not be doing this sort of thing, I believe:
>
> CREATE TABLE orders (
> o_orderkey INTEGER,
> o_custkey INTEGER,
> o_orderstatus CHAR(1),
> o_totalprice REAL,
> o_orderDATE DATE,
> o_orderpriority CHAR(15),
> o_clerk CHAR(15),
> o_shippriority INTEGER,
> o_comment VARCHAR(79),
> PRIMARY KEY (o_orderkey))
>
> Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want using ALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from deferring constarint creation.
>
>
> cheers
>
> andrew
>
>
>
> Mark Wong wrote:
>
> >Hi Alon,
> >
> >Yeah, that helps. I just need to break up my scripts a little to just
> >load the data and not build indexes.
> >
> >Is the following information good enough to give a guess about the data
> >I'm loading, if you don't mind? ;) Here's a link to my script to create
> >tables:
> >http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in
> >
> >File sizes:
> >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
> >-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
> >-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
> >-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
> >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
> >-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
> >-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
> >-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
> >
> >Number of rows:
> ># wc -l *.tbl
> > 15000000 customer.tbl
> > 600037902 lineitem.tbl
> > 25 nation.tbl
> > 150000000 orders.tbl
> > 20000000 part.tbl
> > 80000000 partsupp.tbl
> > 5 region.tbl
> > 1000000 supplier.tbl
> >
> >Thanks,
> >Mark
> >
> >On Tue, 19 Jul 2005 14:05:56 -0700
> >"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
> >
> >
> >
> >>Hi Mark,
> >>
> >>I improved the data *parsing* capabilities of COPY, and didn't touch the
> >>data conversion or data insertion parts of the code. The parsing improvement
> >>will vary largely depending on the ratio of parsing -to- converting and
> >>inserting.
> >>
> >>Therefore, the speed increase really depends on the nature of your data:
> >>
> >>100GB file with
> >>long data rows (lots of parsing)
> >>Small number of columns (small number of attr conversions per row)
> >>less rows (less tuple insertions)
> >>
> >>Will show the best performance improvements.
> >>
> >>However, same file size 100GB with
> >>Short data rows (minimal parsing)
> >>large number of columns (large number of attr conversions per row)
> >>AND/OR
> >>more rows (more tuple insertions)
> >>
> >>Will show improvements but not as significant.
> >>In general I'll estimate 40%-95% improvement in load speed for the 1st case
> >>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> >>etc... This is for TEXT format. As for CSV, it may be faster but not as much
> >>as I specified here. BINARY will stay the same as before.
> >>
> >>HTH
> >>Alon.
> >>
> >>
> >>
> >>
> >>
> >>
> >>On 7/19/05 12:54 PM, "Mark Wong" <markw(at)osdl(dot)org> wrote:
> >>
> >>
> >>
> >>>On Thu, 14 Jul 2005 17:22:18 -0700
> >>>"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
> >>>
> >>>
> >>>
> >>>>I revisited my patch and removed the code duplications that were there, and
> >>>>added support for CSV with buffered input, so CSV now runs faster too
> >>>>(although it is not as optimized as the TEXT format parsing). So now
> >>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
> >>>>
> >>>>
> >>>Hi Alon,
> >>>
> >>>I'm curious, what kind of system are you testing this on? I'm trying to
> >>>load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
> >>>interested in the results you would expect.
> >>>
> >>>Mark
> >>>
> >>>
> >>>
> >

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2005-07-19 22:57:41 Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)
Previous Message Andrew Dunstan 2005-07-19 22:17:52 Re: COPY FROM performance improvements

Browse pgsql-performance by date

  From Date Subject
Next Message PFC 2005-07-19 23:01:56 Re: Looking for tips
Previous Message Andrew Dunstan 2005-07-19 22:17:52 Re: COPY FROM performance improvements