COPY performance

From: "Nigel J(dot) Andrews" <nandrews(at)investsystems(dot)co(dot)uk>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: COPY performance
Date: 2002-04-13 18:13:06
Message-ID: Pine.LNX.4.21.0204131840360.3278-100000@ponder.fairway2k.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

I'm trying to COPY in a table of 1 million rows. The table is created by:

create table chat_post_new (
session_id INT NOT NULL references chat_session (session_id),
poster_name VARCHAR(32) NOT NULL,
time TIMESTAMP NOT NULL,
post_number INT NOT NULL,
FTS txtidx
);

The old definition had the integer columns as int2s and the FTS column wasn't
there. Because I wanted to reorder the rows, add them in a more natural order
(by time), I created the data file read in by the copy command using

SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number
FROM chat_post
ORDER BY time

After removing the first and last couple of lines, so that only the data is in
the file, renaming the original table and creating the new version I tried
running:

COPY chat_post FROM 'file-path'

The data file is 40MB and has somewhere over 1000000 rows of data. During the
copy I can sit and watch the memory foot print of the backend process
growing. It eventually gets to something like 80MB, with 10-20MB less for it's
resident set size. I've not seen this copy in complete because it can't get
sufficient resources from the system after this point.

So splitting the file into 300000 row chunks and doing four COPY commands:

Initially when a COPY kicks off there is fair amount of disk activity, which is
to be expected, but then the memory foot print starts growing (from a fresh
connection size), disk activity drops off to unnoticeable unless I watch and
listen and the CPU usage increases. The memory usage stablises around 30MB with
24MB resident before the CPU utilisation finishes creeping up. Eventually, but
not exactly long after starting, the backend process is clocking up 99% of the
CPU.

So why does the memory usage increase so? Why does is the process only IO bound
initially? And why is the process so CPU intensive?

I'm thinking that the data file is read into memory entirely before any
processing of the data is done but can't work out why the file isn't just
mmap()ed, STDIN can be special cased surely, and how about all the CPU power
required for the ascii to internal representations being so high in a process
that I would have thought to be very nearly IO bound?

I could take a look at the code, not something I've done yet I must admit, and
see if I can contribute by doing the mmap business if it's seen as a good idea.

--
Nigel J. Andrews
Director

---
Logictree Systems Limited
Computer Consultants

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Nigel J. Andrews 2002-04-13 18:19:21 Re: COPY performance
Previous Message Tom Lane 2002-04-13 16:28:22 Re: A *short* planner question

Browse pgsql-general by date

  From Date Subject
Next Message Nigel J. Andrews 2002-04-13 18:19:21 Re: COPY performance
Previous Message google 2002-04-13 17:58:41 Triggers not portable across different pg versions