Skip site navigation (1) Skip section navigation (2)

Optimizing COPY

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Optimizing COPY
Date: 2008-10-30 13:14:14
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
Back in March, I played around with various hacks to improve COPY FROM 

I got busy with other stuff, but I now got around to try what I planned 
back then. I don't know if I have the time to finish this for 8.4, but 
might as well post what I've got.

The basic idea is to replace the custom loop in CopyReadLineText with 
memchr(), because memchr() is very fast. To make that possible, perform 
the client-server encoding conversion on each raw block that we read in, 
before splitting it into lines. That way CopyReadLineText only needs to 
deal with server encodings, which is required for the memchr() to be safe.

Attached is a quick patch for that. Think of it as a prototype; I 
haven't tested it much, and I feel that it needs some further cleanup. 
Quick testing suggests that it gives 0-20% speedup, depending on the 
data. Very narrow tables don't benefit much, but the wider the table, 
the bigger the gain. I haven't found a case where it performs worse.

I'm only using memchr() with non-csv format at the moment. It could be 
used for CSV as well, but it's more complicated because in CSV mode we 
need to keep track of the escapes.

Some collateral damage:
\. no longer works. If we only accept \. on a new line, like we already 
do in CSV mode, it shouldn't be hard or expensive to make it work again. 
The manual already suggests that we only accept it on a single line: 
"End of data can be represented by a single line containing just 
backslash-period (\.)."

Escaping a linefeed or carriage return by prepending it with a backslash 
no longer works. You have to use \n and \r. The manual already warns 
against doing that, so I think we could easily drop support for it. If 
we wanted, it wouldn't be very hard to make it work, though: after 
hitting a newline, scan back and count how many backslashes there is 
before the newline. An odd number means that it's an escaped newline, 
even number (including 0) means it's a real newline.

For the sake of simplifying the code, would anyone care if we removed 
support for COPY with protocol version 2?

   Heikki Linnakangas

Attachment: copy-memchr-1.patch
Description: text/x-diff (16.6 KB)


pgsql-hackers by date

Next:From: Alvaro HerreraDate: 2008-10-30 13:18:06
Subject: Re: Block-level CRC checks
Previous:From: Simon RiggsDate: 2008-10-30 13:04:02
Subject: Re: Hot Standby: Caches and Locks

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group