Optimizing COPY

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Optimizing COPY
Date: 2008-10-30 13:14:14
Message-ID: 4909B326.507@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Back in March, I played around with various hacks to improve COPY FROM
performance:
http://archives.postgresql.org/pgsql-patches/2008-03/msg00145.php

I got busy with other stuff, but I now got around to try what I planned
back then. I don't know if I have the time to finish this for 8.4, but
might as well post what I've got.

The basic idea is to replace the custom loop in CopyReadLineText with
memchr(), because memchr() is very fast. To make that possible, perform
the client-server encoding conversion on each raw block that we read in,
before splitting it into lines. That way CopyReadLineText only needs to
deal with server encodings, which is required for the memchr() to be safe.

Attached is a quick patch for that. Think of it as a prototype; I
haven't tested it much, and I feel that it needs some further cleanup.
Quick testing suggests that it gives 0-20% speedup, depending on the
data. Very narrow tables don't benefit much, but the wider the table,
the bigger the gain. I haven't found a case where it performs worse.

I'm only using memchr() with non-csv format at the moment. It could be
used for CSV as well, but it's more complicated because in CSV mode we
need to keep track of the escapes.

Some collateral damage:
\. no longer works. If we only accept \. on a new line, like we already
do in CSV mode, it shouldn't be hard or expensive to make it work again.
The manual already suggests that we only accept it on a single line:
"End of data can be represented by a single line containing just
backslash-period (\.)."

Escaping a linefeed or carriage return by prepending it with a backslash
no longer works. You have to use \n and \r. The manual already warns
against doing that, so I think we could easily drop support for it. If
we wanted, it wouldn't be very hard to make it work, though: after
hitting a newline, scan back and count how many backslashes there is
before the newline. An odd number means that it's an escaped newline,
even number (including 0) means it's a real newline.

For the sake of simplifying the code, would anyone care if we removed
support for COPY with protocol version 2?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
copy-memchr-1.patch text/x-diff 16.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2008-10-30 13:18:06 Re: Block-level CRC checks
Previous Message Simon Riggs 2008-10-30 13:04:02 Re: Hot Standby: Caches and Locks