Quick Links

CopyReadLineText optimization revisited

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	CopyReadLineText optimization revisited
Date:	2010-08-26 17:47:39
Message-ID:	4C76A8BB.3060502@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I'm reviving the effort I started a while back to make COPY faster:

http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php
http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php

The patch I now have is based on using memchr() to search end-of-line.
In a nutshell:

* we perform possible encoding conversion early, one input block at a
time, rather than after splitting the input into lines. This allows us
to assume in the later stages that the data is in server encoding,
allowing us to search for the '\n' byte without worrying about
multi-byte characters.

* instead of the byte-at-a-time loop in CopyReadLineText(), use memchr()
to find the next NL/CR character. This is where the speedup comes from.
Unfortunately we can't do that in the CSV codepath, because newlines can
be embedded in quoted, so that's unchanged.

These changes seem to give an overall speedup of between 0-10%,
depending on the shape of the table. I tested various tables from the
TPC-H schema, and a narrow table consisting of just one short text column.

I can't think of a case where these changes would be a net loss in
performance, and it didn't perform worse on any of the cases I tested
either.

There's a small fly in the ointment: the patch won't recognize backslash
followed by a linefeed as an escaped linefeed. I think we should simply
drop support for that. The docs already say:

> It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriage return, and to represent a data newline by a backslash and newline. However, these representations might not be accepted in future releases. They are also highly vulnerable to corruption if the COPY file is transferred across different machines (for example, from Unix to Windows or vice versa).

I vaguely recall that we discussed this some time ago already and agreed
that we can drop it if it makes life easier.

This patch is in pretty good shape, however it needs to be tested with
different exotic input formats. Also, the loop in CopyReadLineText could
probaby be cleaned up a bit, some of the uglifications that were done
for performance reasons in the old code are no longer necessary, as
memchr() is doing the heavy-lifting and the loop only iterates 1-2 times
per line in typical cases.

It's not strictly necessary, but how about dropping support for the old
COPY protocol, and the EOF marker \. while we're at it? It would allow
us to drop some code, making the remaining code simpler, and reduce the
testing effort. Thoughts on that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
copyreadline-optimization-1.patch	text/x-diff	26.5 KB

Responses

Re: CopyReadLineText optimization revisited at 2010-08-26 19:16:06 from Tom Lane
Re: CopyReadLineText optimization revisited at 2011-02-19 02:35:10 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2010-08-26 18:47:36	Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session
Previous Message	Thomas Lockhart	2010-08-26 17:11:14	Re: Committers info for the git migration - URGENT!