Re: Importing Large Amounts of Data

From: "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To: "Curt Sampson" <cjs(at)cynic(dot)net>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Importing Large Amounts of Data
Date: 2002-04-15 07:53:51
Message-ID: GNELIHDDFBOCMGBFGEFOMECDCCAA.chriskl@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> As you can see from the schema I gave later in my message, that's
> exactly what I did. But does this actually avoid allocating the
> space in the on-disk tuples? What part of the code deals with this?
> It looks to me like the four bytes for the OID are still allocated
> in the tuple, but not used.

OK, well I guess in that case they are - I'm no expert on the file format.

> But from the looks of it, the production system will be doing daily
> imports of fresh data ranging in size from a copule of million rows
> to a couple of tens of millions of rows.

Well that definitely makes a difference then...

> > It's a bit hard to say "just turn off all the things that
> ensure your data
> > integrity so it runs a bit faster", if you actually need data integrity.
>
> I'm not looking for "runs a bit faster;" five percent either way
> makes little difference to me. I'm looking for a five-fold performance
> increase.

> Anyway, from the looks of it, this is going to be fairly simple
> stuff. (Unfortunately, I don't have details of the real application
> the client has in mind, though I sure wish I did.) What I'm trying
> to indicate when I say "OLAP" is that it's basically selecting
> across broad swaths of a large data set, and doing little or nothing
> in the way of updates. (Except for the daily batches of data, of
> course.)

OK, well now it depends on what kind of selects you're doing. Do you
regularly select over a certain subset of the data, in which case using
partial indices might give you significant speedup. Do you select functions
of columns? If so, then you'll need functional indices. MySQL doesn't have
either of these. However, if you're always doing full table scans, then
MySQL will probably do these faster.

Now, here's another scenario. Suppose you're often querying aggregate data
over particular subsets of the data. Now instead of requerying all the
time, you can set up triggers to maintain your aggregates for you on the
fly. This will give O(1) performance on select compared to O(n). MySQL's
new query cache might help you with this, however.

> I don't want to start a flamewar here, because personally I don't
> even like MySQL and would prefer always to use PostgreSQL. But it
> makes it a lot harder to do so when people keep insisting that
> import speed is not important. Rather than say that, why don't we
> just admit that PosgreSQL is a fairly crap performer in this regard
> at the moment (at least the way I'm doing it), and work out ways
> to fix this?

It depends on your definition. You have to accept a certain overhead if
you're to have data integrity and MVCC. If you can't handle that overhead,
then you can't have data integrity and vice versa.

BTW, instead of:

CREATE UNIQUE INDEX bigone_pkey ON bigone (rec_no);

do:

ALTER TABLE bigone ADD PRIMARY KEY(rec_no);

And remember to run "VACUUM ANALYZE bigone;" or just "ANALYZE bigone;" after
the COPY and before trying to use the table. I'm not sure if it's better to
analyze before or after the indexes are added, but it's definitely better to
vaccum before the indexes are added.

Chris

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Curt Sampson 2002-04-15 08:19:20 Re: Importing Large Amounts of Data
Previous Message Tatsuo Ishii 2002-04-15 07:50:23 Re: [PATCHES] unknownin/out patch (was PQescapeBytea is