Re: INSERT performance deteriorates quickly during a large import

From: Tomáš Vondra <tv(at)fuzzy(dot)cz>
To: "Krasimir Hristozov (InterMedia Ltd)" <krasi(at)imedia-dev(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: INSERT performance deteriorates quickly during a large import
Date: 2007-11-08 17:19:04
Message-ID: 47334508.1000305@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Try to one of these:

a) don't use INSERT statements, use a COPY instead

b) from time to time run ANALYZE on the "public" table (say 1000
inserts, then one analyze)

c) create the table without constraints (primary / foreign keys in this
case), import all the data, and then create the constraints

The (b) and (c) may be combined, i.e. import without constraints and
analyze from time to time. I'd probably try the (a) at first, anyway.

Try to gather some more statistics - is the problem related to CPU or
I/O? Use 'dstat' for example - this might say give you a hint in case
the advices mentioned above don't help.

Tomas

> We need to import data from a relatively large MySQL database into an
> existing PostgreSQL database, using a PHP5 script that SELECTs from MySQL
> and INSERTs in PostgreSQL. A part of the import involves moving about
> 1,300,000 records from one MySQL table to one of our PostgreSQL tables. The
> problem is that the insert performance inevitably deteriorates as the
> number
> of inserts increases.
>
> We tried different approaches:
>
> * selecting only parts of the source table data based on a certain
> condition
> * selecting all of the source data
>
> coupled with either of these:
>
> * inserting without explicit transactions
> * inserting all the data in a single transaction
> * inserting the data in partial transactions of about 100, 1000, 10000,
> 100000 inserts each
>
> While there were performance benefits in some of the cases (selecting all
> the data and inserting in transaction packets of about 1000 each being the
> fastest), the problem was that it still deteriorated as the import
> progressed.
>
> We tried removing all foreign keys and indices from the postgres table,
> still gained performance, but it deteriorated as well.
>
> The latest (and best performing) test we did was under the following
> conditions:
>
> * 11851 pre-existing records in the destination table
> * the table was vacuumed just before the import
> * all foreign keys and indices were removed from the destination table
> * selected all of the data from the source table at once
> * inserted in transactions of 1000 inserts each
>
> We displayed time statistics on each 100 inserts. The process started at
> about 1 second per 100 inserts. This estimated to about 4 hours for the
> entire process. 14 hours later it had imported about a quarter of the data
> (a bit more than 330000 records), and 100 inserts now took nearly 40
> seconds.
>
> We tested reading from MySQL alone, without inserting the data in Postgres.
> All records were read in about a minute and a half, so MySQL performance is
> not a part of the issue. The PHP script selects the MySQL data, fetches
> rows
> sequentially, occasionally performs a couple of selects against PostgreSQL
> data (which is cached in a PHP array to reduce the DB operations; no more
> than 80000 array elements, integer keys, integer data), and inserts into
> PostgreSQL. The algorithm seems to be linear in nature and perfomance
> deterioration most probably doesn't have to do with the PHP code.
>
> Has anyone had an issue like this, and perhaps a suggestion for a possible
> cause and solution? Is it common for PostgreSQL to grow so slow as the
> amount of data in the tables increases? If so, is it just the insert
> operation or all kinds of queries? Isn't 300000 records too low a threshold
> for such performance deterioration?
>
> Here are some technical details, that might be helpful:
>
> * PHP, MySQL and PostgreSQL all work on the same server, sharing the same
> memory and hard drive.
> * the server runs FreeBSD 5.3-RELEASE on an AMD Athlon(tm) 64 Processor
> 3000+ (2GHz K8 class CPU) with 1GB RAM
> * the software versions installed are Apache 1.3.37, PHP 5.2.1, MySQL
> 4.1.22, PostgreSQL 8.1.8
> * postgresql.conf variables other than defaults are: max_connections = 40,
> shared_buffers = 1000 (this is the default)
> * we have also tried these on another server with Red Hat Enterprise Linux
> ES release 4 (Linux 2.6.9-42.0.3.ELsmp) on 2xDual Core AMD Opteron(tm)
> Processor 270 (4x2GHz logical CPUs) with 2GB RAM
> * both servers run in x86_64 mode, PostgreSQL footprint in memory stays
> relatively small, CPU usage maxes out on import, there is no resource
> starvation in any way
>
> DDL statement for the creation of the PostgreSQL table in question:
>
> CREATE TABLE "public"."sp_thread_replies" (
> "id" SERIAL,
> "thread_id" INTEGER NOT NULL,
> "body" TEXT NOT NULL,
> "ts_added" INTEGER DEFAULT 0 NOT NULL,
> "user_id" INTEGER NOT NULL,
> "thread_offset" INTEGER DEFAULT 0,
> "approved" SMALLINT DEFAULT 1,
> "title" TEXT,
> "deleted" SMALLINT DEFAULT 0,
> "edit_reason" VARCHAR(255),
> "edit_user_id" INTEGER,
> "edit_time" INTEGER,
> CONSTRAINT "sp_thread_replies_pkey" PRIMARY KEY("id"),
> CONSTRAINT "sp_thread_replies_threads_fk" FOREIGN KEY ("thread_id")
> REFERENCES "public"."sp_threads"("id")
> ON DELETE CASCADE
> ON UPDATE NO ACTION
> NOT DEFERRABLE,
> CONSTRAINT "sp_thread_replies_users_fk" FOREIGN KEY ("user_id")
> REFERENCES "public"."sp_users"("id")
> ON DELETE NO ACTION
> ON UPDATE NO ACTION
> NOT DEFERRABLE
> ) WITH OIDS;
>
> The table is a part of a custom forum engine. It stores all thread posts.
> It's most often queried with SELECTs and INSERTSs, less often with UPDATEs,
> and records are deleted quite seldom in normal operation of the application
> (though we may delete records manually from the console from time to time).
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Pavel Stehule 2007-11-08 17:30:45 Re: why there is no interval / interval operator?
Previous Message Ted Byers 2007-11-08 17:10:17 Optimal time series sampling.