Re: COPY FROM performance improvements

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, agoldshuv(at)greenplum(dot)com
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY FROM performance improvements
Date: 2005-06-25 22:48:53
Message-ID: BEE32D65.7D05%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Andrew,

> What I would like to have is a high level description of
> . how the new text mode code differs from the old text mode code, and
> . which part of the change is responsible for how much performance gain.
>
> Maybe I have missed that in previous discussion, but this change is
> sufficiently invasive that I think you owe that to the reviewers.

You're right - this is a nearly full replacement of the code for the text
mode, so some explanation is necessary.

We have users that are focused on Business Intelligence and Data Warehousing
use cases. They routinely load files of sizes upward of 1GB and some load n
x 10GB per load. They normally use a text format because it's the easiest
and fastest to produce and should also be fast to load. In the absence of a
configurable format loader like Oracle's SQL*Loader, the COPY FROM path is
what they use. We'll leave the formatting discussion for later, because
there are definitely big improvements needed to serve this audience, but
there is too much info to cover here before 8.1 freeze, so back to
performance.

Our customers noticed that PostgreSQL's COPY text performance was
substantially slower than Oracle and far slower than the underlying disk
subsystems. Typical performance ranged from 4MB/s to 8MB/s and was
bottlenecked on CPU. Disk subsystems we use are typically capable of n x
100MB/s write rates. This was proving to be a big problem for daily update
loads on warehouses, so we looked into what the slowdown was.

We profiled the copy input path and found that the combination of per
character I/O (fgetc, etc) and the subsequent per character logic was
responsible for a large fraction of the time it took to load the data. The
parsing routine was running at 17 MB/s on the fastest (Opteron) CPUs at very
high compiler optimization, whether taking input from the network or from
file on a backend COPY. Given other logic we've worked with for high
performance I/O subsystems, we knew we could improve this to well over
100MB/s on typical machines, but it would require rewriting the logic to
expose more to the compiler and CPU.

An improvement would require exposing much more long runs of instructions
with efficient access to memory to the compiler/CPU to allow for pipelining
and micro-parallelism to become effective in order to reach higher I/O input
rates. That is why the current patch reads a buffer of input, then scans
for the "special" bytes (line ends, escapes, delimiters), avoiding copies
until they can be efficiently done in bulk operations. The resulting code
now runs at over 100MB/s, which exposes the remainder of the COPY path as
the new bottleneck. Now we see between 20% and 90% performance improvements
in COPY speed, which will make many customers in BI/DW very happy.

I do not believe that significant performance gains can be made without this
invasive change to the copy.c routine because of the need for extensive
buffering and array logic, which is requires a substantially different
organization of the processing.

I hope this helps,

- Luke

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Alon Goldshuv 2005-06-25 22:52:29 Re: COPY FROM performance improvements
Previous Message Alon Goldshuv 2005-06-25 22:17:08 Re: COPY FROM performance improvements