Importing Large Amounts of Data

From: Curt Sampson <cjs(at)cynic(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Importing Large Amounts of Data
Date: 2002-04-15 06:48:50
Message-ID: Pine.NEB.4.43.0204151445170.439-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Appended is a message that I sent to the pgsql-general list, and
for which I received no useful reply. (Well, anyway, no reply that
has helped me to speed up my imports.) If you've read it already,
feel free to ignore it, but if you haven't, I'd appreciate any
advice on how to make this work as fast as possible.

There are some things that could be done to help optimize situations
like this, though I don't think any can be done in the short term.
Here are some of my thoughts, which may or may not be useful:

1. Postgres appears to have a fairly high row overhead (40 bytes
or so according to the FAQ), which grieves me slightly, as that's
actually larger than the size of the data in my tuples. It would
seem that in my case some of the items in that header (the OID and
the NULL bitfield) are not used; would it be possible to avoid
allocating these in this relations that don't use them?

Also, is there a good description of the on-disk tuple format
somewhere? include/access/htup.h seems to require a fair bit of
knowledge about how other parts of the system work to be understood.

2. Does it make sense to be able to do some operations without
logging? For the COPY, if the system crashes and I lose some or
all all the tuples I'd imported so far, I don't care that much; I
can just restart the COPY at an appropriate point. As mentioned
below, that would save half a gig of disk writes when importing 5M
tuples.

3. How about having a way to take a table off-line to work on it,
and bring it back on-line again when done? This would get rid of
the logging overhead, locking overhead, and that sort of stuff,
and in theory might be able to get you something approaching
disk-speed data imports.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

---------- Forwarded message ----------
Date: Thu, 11 Apr 2002 17:28:13 +0900 (JST)
From: Curt Sampson <cjs(at)cynic(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: [GENERAL] Importing Large Amounts of Data

I've been asked by a client to do some testing of Postgres for what
appears to be OLAP on a fairly large data set (about half a billion
tuples). I'm probably going to want to try partitioning this in various
ways, but the application, not Postgres, will deal with that.

I'm using PostgreSQL 7.2.1, and the schema I'm testing with is as follows:

CREATE TABLE bigone (
rec_no INT PRIMARY KEY,
day DATE NOT NULL,
user_id CHAR(5) NOT NULL,
value VARCHAR(20) NOT NULL
) WITHOUT OIDS;
DROP INDEX bigone_pkey;

[COPY is done here....]

CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);
CREATE INDEX bigone_day ON bigone (day);
CREATE INDEX bigone_user_id ON bigone (user_id);

Unfortunately, the first problem I've run into is that importing is
rather slow. With all indexes (including the bigone_pkey) dropped,
importing five million tuples into the above table, starting from empty,
takes about 921 seconds. The second 5M tuples takes about 1009 seconds.
If I switch to using the -F option, the first 5M takes 714 seconds and the
second 5M takes 742 seconds. At the end, I have about 742 MB of data under
the data/base directory. (This is using a fresh database cluster.)

For comparison, the MySQL does each import in about 221 and 304 seconds,
and the data in the end take up about 427 MB.

Part of the problem here may be that Postgres appears to be logging the
COPY operation; I get from 27-33 "recycled transaction log file" messages
for every 5M tuple COPY that I do. If there were a way to do an unlogged
copy, that might save close to half a gig of writes to the disk.

The other part of the problem may just be the size of the data;
why does Postgres take up 75% more space (about 78 bytes per tuple,
vs. 45 bytes per tuple) for this table?

As well, index builds seem to take about 20% longer (using -F), and they
seem to be about 10% larger as well.

Does anybody have any suggestions as to how I can improve performance
here, and reduce disk space requirements? If not, I'll probably have
to suggest to the client that he move to MySQL for this particular
application, unless he needs any of the features that Postgres provides
and MySQL doesn't.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2002-04-15 07:06:00 Re: Importing Large Amounts of Data
Previous Message Daniel Kalchev 2002-04-15 06:36:34 more on large oids