Re: CopyReadLineText optimization

From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, <pgsql-patches(at)postgresql(dot)org>
Cc: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>, <eng(at)intranet(dot)greenplum(dot)com>
Subject: Re: CopyReadLineText optimization
Date: 2008-02-24 01:46:40
Message-ID: 014F2941B0A1EA47BD61D21526B806E901075565@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Cool! It's been a while since we've done the same kind of thing :-)

- Luke

> -----Original Message-----
> From: pgsql-patches-owner(at)postgresql(dot)org
> [mailto:pgsql-patches-owner(at)postgresql(dot)org] On Behalf Of
> Heikki Linnakangas
> Sent: Saturday, February 23, 2008 5:30 PM
> To: pgsql-patches(at)postgresql(dot)org
> Subject: [PATCHES] CopyReadLineText optimization
>
> The purpose of CopyReadLineText is to scan the input buffer,
> and find the next newline, taking into account any escape
> characters. It currently operates in a loop, one byte at a
> time, searching for LF, CR, or a backslash. That's a bit
> slow: I've been running oprofile on COPY, and I've seen
> CopyReadLine to take around ~10% of the CPU time, and Joshua
> Drake just posted a very similar profile to hackers.
>
> Attached is a patch that modifies CopyReadLineText so that it
> uses memchr to speed up the scan. The nice thing about memchr
> is that we can take advantage of any clever optimizations
> that might be in libc or compiler.
>
> In the tests I've been running, it roughly halves the time
> spent in CopyReadLine (including the new memchr calls), thus
> reducing the total CPU overhead by ~5%. I'm planning to run
> more tests with data that has backslashes and with different
> width tables to see what the worst-case and best-case
> performance is like. Also, it doesn't work for CSV format at
> the moment; that needs to be fixed.
>
> 5% isn't exactly breathtaking, but it's a start. I tried the
> same trick to CopyReadAttributesText, but unfortunately it
> doesn't seem to help there because you need to "stop" the
> efficient word-at-a-time scan that memchr does (at least with
> glibc, YMMV) whenever there's a column separator, while in
> CopyReadLineText you get to process the whole line in one
> call, assuming there's no backslashes.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-02-24 08:14:24 Re: Batch update of indexes on data loading
Previous Message Joshua D. Drake 2008-02-24 01:45:51 Re: 8.3 / 8.2.6 restore comparison

Browse pgsql-patches by date

  From Date Subject
Next Message Tatsuhito Kasahara 2008-02-24 11:21:28 Re: Fix pgstatindex using for large indexes
Previous Message Heikki Linnakangas 2008-02-24 01:29:47 CopyReadLineText optimization