Re: COPY FROM performance improvements

From: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
To: "Mark Wong" <markw(at)osdl(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-07-19 22:06:17
Message-ID: BF02C769.72A8%agoldshuv@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches pgsql-performance

Mark,

Thanks for the info.

Yes, isolating indexes out of the picture is a good idea for this purpose.

I can't really give a guess to how fast the load rate should be. I don't
know how your system is configured, and all the hardware characteristics
(and even if I knew that info I may not be able to guess...). I am pretty
confident that the load will be faster than before, I'll risk that ;-)
Looking into your TPC-H size and metadata I'll estimate that
partsupp,customer and orders will have the most significant increase in load
rate. You could start with those.

I guess the only way to really know is to try... Load several times with the
existing PG-COPY and then load several times with the patched COPY and
compare. I'll be curious to hear your results.

Thx,
Alon.

On 7/19/05 2:37 PM, "Mark Wong" <markw(at)osdl(dot)org> wrote:

> Hi Alon,
>
> Yeah, that helps. I just need to break up my scripts a little to just
> load the data and not build indexes.
>
> Is the following information good enough to give a guess about the data
> I'm loading, if you don't mind? ;) Here's a link to my script to create
> tables:
> http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f
> 7f23437e432&path=scripts/pgsql/create_tables.sh.in
>
> File sizes:
> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
> -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
> -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
> -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
> -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
> -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
> -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
>
> Number of rows:
> # wc -l *.tbl
> 15000000 customer.tbl
> 600037902 lineitem.tbl
> 25 nation.tbl
> 150000000 orders.tbl
> 20000000 part.tbl
> 80000000 partsupp.tbl
> 5 region.tbl
> 1000000 supplier.tbl
>
> Thanks,
> Mark
>
> On Tue, 19 Jul 2005 14:05:56 -0700
> "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
>
>> Hi Mark,
>>
>> I improved the data *parsing* capabilities of COPY, and didn't touch the
>> data conversion or data insertion parts of the code. The parsing improvement
>> will vary largely depending on the ratio of parsing -to- converting and
>> inserting.
>>
>> Therefore, the speed increase really depends on the nature of your data:
>>
>> 100GB file with
>> long data rows (lots of parsing)
>> Small number of columns (small number of attr conversions per row)
>> less rows (less tuple insertions)
>>
>> Will show the best performance improvements.
>>
>> However, same file size 100GB with
>> Short data rows (minimal parsing)
>> large number of columns (large number of attr conversions per row)
>> AND/OR
>> more rows (more tuple insertions)
>>
>> Will show improvements but not as significant.
>> In general I'll estimate 40%-95% improvement in load speed for the 1st case
>> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>> etc... This is for TEXT format. As for CSV, it may be faster but not as much
>> as I specified here. BINARY will stay the same as before.
>>
>> HTH
>> Alon.
>>
>>
>>
>>
>>
>>
>> On 7/19/05 12:54 PM, "Mark Wong" <markw(at)osdl(dot)org> wrote:
>>
>>> On Thu, 14 Jul 2005 17:22:18 -0700
>>> "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
>>>
>>>> I revisited my patch and removed the code duplications that were there, and
>>>> added support for CSV with buffered input, so CSV now runs faster too
>>>> (although it is not as optimized as the TEXT format parsing). So now
>>>> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original
>>>> file.
>>>
>>> Hi Alon,
>>>
>>> I'm curious, what kind of system are you testing this on? I'm trying to
>>> load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
>>> interested in the results you would expect.
>>>
>>> Mark
>>>
>>
>

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Andrew Dunstan 2005-07-19 22:17:52 Re: COPY FROM performance improvements
Previous Message Jim C. Nasby 2005-07-19 21:51:03 Re: [HACKERS] Patch to fix plpython on OS X

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Dunstan 2005-07-19 22:17:52 Re: COPY FROM performance improvements
Previous Message Dario 2005-07-19 21:41:18 Re: join and query planner