Re: 8.3 / 8.2.6 restore comparison

From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 8.3 / 8.2.6 restore comparison
Date: 2008-02-24 17:40:11
Message-ID: 47C1ABFB.1050006@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Joshua D. Drake wrote:
>> On Sun, 24 Feb 2008 00:43:18 +0000
>> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:

>> I know that copy is in theory a bulk loader but, when performing the
>> readline how many lines are we reading? Do we read up to 8192? Or do we
>> shove in say 8megs of data before we invoke DoCopy?
>
> We read 64 KB at a time, and then CopyReadLineText returns one line at a
> time from that buffer.

O.k. I am sure I am oversimplifying things but why are we returning one
line at a time? That reads expensive to me. Just following the general,
don't do inserts one at a time, do them in batch idea for example.

I would also question the 64KB at a time. Why not a 1024KB (arbitrary)
at a time? Is it a resource issue? In the old days when we actually had
people trying to run postgresql on 128 and 256 megs of ram, o.k. but now?

> Looking at your profile more, and after the memchr patch, the "raw input
> side" of copy, consisting of reading the data from disk in 64KB blocks,
> splitting that into lines, and splitting lines into columns, still takes
> ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest
> culprit there.

In reading:

http://doxygen.postgresql.org/backend_2commands_2copy_8c-source.html

It looks like CopyReadAttributesText is used as part of the column
breakup. It also appears that this happens "before" insert right? So if
that is the case we are going to pay an additional penalty on the data
checking.

>
> You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by
> creating the table (or truncating it) in the same transaction with the
> COPY.

In the same transaction? Oh that's interesting. So that might be a TODO
right there. Change pg_dump so it does:

create,copy,create,copy,index

Instead of:

create,create,copy,copy,index

>
> After that, heap_formtuple is high on the list. I wonder if we could do
> something about that.

Just from the name I am going to guess this is where we turn it into
something that is actually injected into PostgreSQL.

> I don't see any piece of code that's causing problems. We can shave off
> a few percents here and there I think, but don't expect a 300%
> improvement anytime soon. A few ideas I've thought about are:

Well don't get me wrong, I am not expecting miracles here. I am just
confounded at the complete lack of performance in this arena. I don't
think a lot of people recognize what a significant issue this is since
we don't have in place backups.

>
> - use a specialized version of strtol, for base 10. That won't help on
> your table, but I've seen strtol consume a significant amount of time on
> tables with numeric/integer columns.
>
> - Instead of pallocing and memcpying the text fields, leave a little bit
> of room between fields in the attribute_buf, and write the varlen header
> there directly. This might help you since your table has a lot of text
> fields.
>
> - Instead of the normal PG function calling conventions, provide
> specialized fastpath input functions for the most common data types.
> InputFunctionCall consumed 4.5% of the CPU time in your profile.
>
> - Use a simpler memory context implementation, that's like a stack with
> no pfree support, for the per-tuple context.
>

By my calculations you are presenting a possibility of at least ~ 30%
improvement. That is significant in my book. Hopefully as we explore
these options we will find others.

Sincerely,

Joshua D. Drake

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2008-02-24 17:47:15 Re: 8.3 / 8.2.6 restore comparison
Previous Message Florian G. Pflug 2008-02-24 17:38:19 Re: Behaviour of rows containg not-null domains in plpgsql