COPY FROM performance improvements

From: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: COPY FROM performance improvements
Date: 2005-06-23 00:14:28
Message-ID: BEDF4CF4.5F3C%agoldshuv@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is a second iteration of a previous thread that didn't resolve few
weeks ago. I made some more modifications to the code to make it compatible
with the current COPY FROM code and it should be more agreeable this time.

The main premise of the new code is that it improves the text data parsing
speed by about 4-5x, resulting in total improvements that lie between 15% to
95% for data importing (higher range gains will occur on large data rows
without many columns - implying more parsing and less converting to internal
format). This is done by replacing a char-at-a-time parsing with buffered
parsing and also using fast scan routines and minimum amount of
loading/appending into line and attribute buf.

The new code passes both COPY regression tests (copy, copy2) and doesn't
break any of the others.

It also supports encoding conversions (thanks Peter and Tatsuo and your
feedback) and the 3 line-end types. Having said that, using COPY with
different encodings was only minimally tested. We are looking into creating
new tests and hopefully add them to postgres regression suite one day if
it's desired by the community.

This new code is improving the delimited data format parsing. BINARY and CSV
will stay the same and will be executed separately for now (therefore there
is some code duplication) In the future I plan to write improvements to the
CSV path too, so that it will be executed without duplication of code.

I am still missing supporting data that uses COPY_OLD_FE (question: what are
the use cases? When will it be used? Please advise)

I'll send out the patch soon. It's basically there to show that there is a
way to load data faster. In future releases of the patch it will be more
complete and elegant.

I'll appreciate any comments/advices.

Alon.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-06-23 01:09:22 GiST rtree logic is not right
Previous Message Denis Lussier 2005-06-22 23:44:58 PL/pgSQL Debugger Support