Big question on insert performance/using COPY FROM

From: "Morgan Kita" <mkita(at)verseon(dot)com>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Big question on insert performance/using COPY FROM
Date: 2005-08-31 19:17:38
Message-ID: 08B420FF5BF7BC42A064212C2EB768801C10BE@neutron.verseon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I am currently trying to speed up the insertion of bulk loads to my database. I have fiddled with all of the parameters that I have seen suggested(aka checkpoint_segments, checkpoint_timeout, maintinence_work_mem, and shared buffers) with no success. I even turned off fysnc with no effect so I am pretty sure the biggest problem is that the DB is CPU limited at the moment because of the rather weak machine that postmaster is running on(Athlon 2400+ xp with 512 RAM), but that will change in the future so I am trying to get performance increases that don't involve changing the machine at the moment.

I am currently inserting into the database through lipqxx's C++ interface. I am using prepared statements that perform regular inserts. I would like to use COPY FROM since I have read so much about its increased performance with respect to INSERT, but I am not sure how to use it in my case. So let me give you an idea on how the tables are laid out.

The real DB has more tables, but lets say for the sake of argument I have 3 tables; TB1, TB2, TB3. Lets say that TB1 has a primary key PK1 and a unique identifier column(type text) UK1 that has an index on it. TB2 then has a PK2, a UK2(type text) of its own with an index, and a foreign key FK2 that points to TB1's PK1. TB3 has a PK3 and a FK3 that points to FK2.
TB1 TB2 TB3
-------------- ------------------------------- ----------------------
PK1, UK1 PK2, UK2, FK2(PK1) PK3, FK3(PK2)

Now in lipqxx I am parsing an input list of objects that are then written to these tables. Each object may produce one row in TB1, one row in TB2, and one row in TB3. The UK1 and UK2 indentifiers are used to prevent duplicate entries for TB1 and TB2 respectively. I know COPY FROM obeys these unique checks; however, my problem is the FKs. So lets say I try to insert a row into TB1. If it is unique on UK1 then it inserts a new row with some new primary key int4 identifier and if it is a duplicate then no insert is done but the already existing row's primary key identifier is returned. This identifier(duplicate or not) is used when populating TB2's row as the FK2 identifier. The row that is to be inserted into TB2 needs the primary key indentifier from the result of the attempted insert into TB1. Similarily the insert into TB3 needs the result of the pk indentifier of the attempted insert into TB2. Once that is done then I move on to parsing the next object for insertion into the 3 tables.

So lets say I want to insert a list of objects using COPY FROM... whats the way to do it? How can I at the very least get a list of the primary keys of TB1(newly inserted rows or from already existings row) returned from the COPY FROM insert into TB1 so I can use them for the COPY FROM insert into TB2 and so on? Is there a better way to do this?

P.S. I am going to setup autovacuum for these bulk loads. My question though is why for bulkloads is VACUUM useful? I understand that it frees up dead rows as a result of UPDATE and such, but where are the dead rows created from plain INSERTS?

Thanks,
Morgan

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tobias Brox 2005-08-31 19:22:17 Re: Swapping
Previous Message Tom Lane 2005-08-31 18:49:18 Re: Swapping