Skip site navigation (1) Skip section navigation (2)

Re: 8.3 / 8.2.6 restore comparison

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>,"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>,"Luke Lonergan" <llonergan(at)greenplum(dot)com>,"Greg Smith" <gsmith(at)gregsmith(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 8.3 / 8.2.6 restore comparison
Date: 2008-02-24 09:39:42
Message-ID: 47C13B5E.3020604@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Joshua D. Drake wrote:
> On Sun, 24 Feb 2008 00:43:18 +0000
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:
>  
>> Incidentally, I've been working on a patch to speed up CopyReadLine.
>> I was going to run some more tests first, but since we're talking
>> about it, I guess I should just post the patch. I'll post to
>> pgsql-patches shortly.
> 
> On your post to patches you mentioned only about a 5% improvement.
> Don't get me wrong, 5% is 5% and I respect it greatly but as far as I
> can tell we are about 300% behind the curve.

Yeah. Looking at the profile, the time is spent really all over the 
place. There's no one clear bottleneck to focus on. I think we could do 
a few more ~5% improvements, but

At some point, I think we have to bite the bullet and find a way to use 
multiple CPUs for a single load. I don't have any good ideas or plans 
for that, but hopefully someone does.

> My tests were maxing out at ~22G an hour. On hardware that can do 
> in 360G an hour and that is assuming > 50% overhead between OS, libs,
> etc... I have no choice but to conclude we have a much, much deeper and
> fundamental issue going on with COPY. I am inspired by Itagaki Takahiro
> and his batch update of indexes which should help greatly overall but
> doesn't help my specific issue.

Yep, the index build idea is an I/O improvement, not a CPU one.

> Forgive me for not being a C programmer and Alvaro is not online so I
> would vet these questions with him first.
> 
> I know that copy is in theory a bulk loader but, when performing the
> readline how many lines are we reading?  Do we read up to 8192? Or do we
> shove in say 8megs of data before we invoke DoCopy?

We read 64 KB at a time, and then CopyReadLineText returns one line at a 
time from that buffer.

Looking at your profile more, and after the memchr patch, the "raw input 
side" of copy, consisting of reading the data from disk in 64KB blocks, 
splitting that into lines, and splitting lines into columns, still takes 
~20% of the CPU time. I suspect CopyReadAttributesText is the biggest 
culprit there.

You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by 
creating the table (or truncating it) in the same transaction with the COPY.

After that, heap_formtuple is high on the list. I wonder if we could do 
something about that.

> I am just curious if there is some simple low hanging fruit that is
> possibly missing.

I don't see any piece of code that's causing problems. We can shave off 
a few percents here and there I think, but don't expect a 300% 
improvement anytime soon. A few ideas I've thought about are:

- use a specialized version of strtol, for base 10. That won't help on 
your table, but I've seen strtol consume a significant amount of time on 
tables with numeric/integer columns.

- Instead of pallocing and memcpying the text fields, leave a little bit 
of room between fields in the attribute_buf, and write the varlen header 
there directly. This might help you since your table has a lot of text 
fields.

- Instead of the normal PG function calling conventions, provide 
specialized fastpath input functions for the most common data types. 
InputFunctionCall consumed 4.5% of the CPU time in your profile.

- Use a simpler memory context implementation, that's like a stack with 
no pfree support, for the per-tuple context.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

Responses

pgsql-hackers by date

Next:From: Heikki LinnakangasDate: 2008-02-24 09:44:39
Subject: Re: 8.3 / 8.2.6 restore comparison
Previous:From: Simon RiggsDate: 2008-02-24 08:14:24
Subject: Re: Batch update of indexes on data loading

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group