Re: Importing Large Amounts of Data

From: Curt Sampson <cjs(at)cynic(dot)net>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Importing Large Amounts of Data
Date: 2002-04-15 07:41:19
Message-ID: Pine.NEB.4.43.0204151616020.439-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 15 Apr 2002, Christopher Kings-Lynne wrote:

> > ...the OID and
> > the NULL bitfield) are not used; would it be possible to avoid
> > allocating these in this relations that don't use them?
>
> CREATE TABLE WITHOUT OIDS ...

As you can see from the schema I gave later in my message, that's
exactly what I did. But does this actually avoid allocating the
space in the on-disk tuples? What part of the code deals with this?
It looks to me like the four bytes for the OID are still allocated
in the tuple, but not used.

> This conclusion seems to me to be remarkably shortsighted. Does the initial
> data load into the database occur just once or quite often?

Well, I'm going to be doing the initial load (.5 billion tuples) quite
a few times, in order to test some different partitioning arrangements.
So I'll save quite a lot of time initially if I get a faster import.

But from the looks of it, the production system will be doing daily
imports of fresh data ranging in size from a copule of million rows
to a couple of tens of millions of rows.

> It's a bit hard to say "just turn off all the things that ensure your data
> integrity so it runs a bit faster", if you actually need data integrity.

I'm not looking for "runs a bit faster;" five percent either way
makes little difference to me. I'm looking for a five-fold performance
increase.

> Anyway, from what I understand an OLTP application is all about selects and
> memoising certain aggregate results.

I guess that was a typo, and you meant OLAP?

Anyway, from the looks of it, this is going to be fairly simple
stuff. (Unfortunately, I don't have details of the real application
the client has in mind, though I sure wish I did.) What I'm trying
to indicate when I say "OLAP" is that it's basically selecting
across broad swaths of a large data set, and doing little or nothing
in the way of updates. (Except for the daily batches of data, of
course.)

> The fact that you can load stuff quicker in
> MySQL and it takes up less disk space seems totally irrelevant.

Yeah, everybody's telling me this. Let me try once again here:

1. Every day, I must import millions, possibly tens of
millions, of rows of data. Thus, speed of import is indeed
fairly important to me.

2. It looks, at least at this point, as if the application
will be doing only fairly simple selects out of the current
half-billion rows of data and whatever gets added in the
future. Thus, I don't think that using MySQL would be a
problem. (If I did, I wouldn't be proposing it.)

I don't want to start a flamewar here, because personally I don't
even like MySQL and would prefer always to use PostgreSQL. But it
makes it a lot harder to do so when people keep insisting that
import speed is not important. Rather than say that, why don't we
just admit that PosgreSQL is a fairly crap performer in this regard
at the moment (at least the way I'm doing it), and work out ways
to fix this?

> Just wait until your MySQL server crashes and your client finds that half
> his data is corrupted...

If there are no updates, why would anything be corrupted? At any
rate, I can always restore from backup, since little or nothing
would be lost.

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2002-04-15 07:50:23 Re: [PATCHES] unknownin/out patch (was PQescapeBytea is
Previous Message Michael Loftis 2002-04-15 07:35:32 Re: ANSI Compliant Inserts