Quick Links

Re: COPY insert performance

From:	John A Meinel <john(at)arbash-meinel(dot)com>
To:	Chris Isaacson <cisaacson(at)tradebotsystems(dot)com>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: COPY insert performance
Date:	2005-07-25 23:08:33
Message-ID:	42E570F1.7090500@arbash-meinel.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Chris Isaacson wrote:
> I need COPY via libpqxx to insert millions of rows into two tables. One
> table has roughly have as many rows and requires half the storage. In
> production, the largest table will grow by ~30M rows/day. To test the
> COPY performance I split my transactions into 10,000 rows. I insert
> roughly 5000 rows into table A for every 10,000 rows into table B.
>
> Table A has one unique index:
>
> "order_main_pk" UNIQUE, btree (cl_ord_id)
>
> Table B has 1 unique index and 2 non-unique indexes:
>
> "order_transition_pk" UNIQUE, btree (collating_seq)
> "order_transition_ak2" btree (orig_cl_ord_id)
> "order_transition_ak3" btree (exec_id)

Do you have any foreign key references?
If you are creating a table for the first time (or loading a large
fraction of the data), it is common to drop the indexes and foreign keys
first, and then insert/copy, and then drop them again.

Is InnoDB the backend with referential integrity, and true transaction
support? I believe the default backend does not support either (so it is
"cheating" to give you speed, which may be just fine for your needs,
especially since you are willing to run fsync=false).

I think moving pg_xlog to a dedicated drive (set of drives) could help
your performance. As well as increasing checkpoint_segments.

I don't know if you gain much by changing the bg_writer settings, if you
are streaming everything in at once, you probably want to have it
written out right away. My understanding is that bg_writer settings are
for the case where you have mixed read and writes going on at the same
time, and you want to make sure that the reads have time to execute (ie
the writes are not saturating your IO).

Also, is any of this tested under load? Having a separate process issue
queries while you are loading in data. Traditionally MySQL is faster
with a single process inserting/querying for data, but once you have
multiple processes hitting it at the same time, it's performance
degrades much faster than postgres.

You also seem to be giving MySQL 512M of ram to work with, while only
giving 2M/200M to postgres. (re)creating indexes uses
maintenance_work_mem, but updating indexes could easily use work_mem.
You may be RAM starved.

John
=:->

>
> My testing environment is as follows:
> -Postgresql 8.0.1
> -libpqxx 2.5.0
> -Linux 2.6.11.4-21.7-smp x86_64
> -Dual Opteron 246
> -System disk (postgres data resides on this SCSI disk) - Seagate
> (ST373453LC) - 15K, 73 GB
> (http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,549,00.html)
> -2nd logical disk - 10K, 36GB IBM SCSI (IC35L036UCDY10-0) - WAL reside
> on this disk
> -NO RAID
>
> *PostgreSQL*
> Here are the results of copying in 10M rows as fast as possible:
> (10K/transaction)
> Total Time: 1129.556 s
> Rows/sec: 9899.922
> Transaction>1.2s 225
> Transaction>1.5s 77
> Transaction>2.0s 4
> Max Transaction 2.325s
>
> **MySQL**
> **I ran a similar test with MySQL 4.1.10a (InnoDB) which produced these
> results: (I used MySQL's INSERT INTO x VALUES
> (1,2,3)(4,5,6)(...,...,...) syntax) (10K/transaction)
> Total Time: 860.000 s
> Rows/sec: 11627.91
> Transaction>1.2s 0
> Transaction>1.5s 0
> Transaction>2.0s 0
> Max Transaction 1.175s
>
> Considering the configurations shown below, can anyone offer advice to
> close the 15% gap and the much worse variability I'm experiencing. Thanks
>
> My *postgresql.conf* has the following non-default values:
> # -----------------------------
> # PostgreSQL configuration file
> # -----------------------------
> listen_addresses = '*' # what IP interface(s) to listen on;
> max_connections = 100
> #---------------------------------------------------------------------------
> # RESOURCE USAGE (except WAL)
> #---------------------------------------------------------------------------
> shared_buffers = 65536 # min 16, at least max_connections*2, 8KB each
> work_mem = 2048 # min 64, size in KB
> maintenance_work_mem = 204800 # min 1024, size in KB
> max_fsm_pages = 2250000 # min max_fsm_relations*16, 6 bytes each
> bgwriter_delay = 200 # 10-10000 milliseconds between rounds
> bgwriter_percent = 10 # 0-100% of dirty buffers in each round
> bgwriter_maxpages = 1000 # 0-1000 buffers max per round
> #---------------------------------------------------------------------------
> # WRITE AHEAD LOG
> #---------------------------------------------------------------------------
> fsync = false # turns forced synchronization on or off
> wal_buffers = 64 # min 4, 8KB each
> checkpoint_segments = 40 # in logfile segments, min 1, 16MB each
> checkpoint_timeout = 600 # range 30-3600, in seconds
> #---------------------------------------------------------------------------
> # QUERY TUNING
> #---------------------------------------------------------------------------
> effective_cache_size = 65536 # typically 8KB each
> random_page_cost = 2 # units are one sequential page fetch cost
> #---------------------------------------------------------------------------
> # ERROR REPORTING AND LOGGING
> #---------------------------------------------------------------------------
>
> log_min_duration_statement = 250 # -1 is disabled, in milliseconds.
> log_connections = true
> log_disconnections = true
> log_duration = true
> log_line_prefix = '<%r%u%p%t%d%%' # e.g. '<%u%%%d> '
> # %u=user name %d=database name
> # %r=remote host and port
> # %p=PID %t=timestamp %i=command tag
> # %c=session id %l=session line number
> # %s=session start timestamp %x=transaction id
> # %q=stop here in non-session processes
> # %%='%'
> log_statement = 'none' # none, mod, ddl, all
> #---------------------------------------------------------------------------
> # RUNTIME STATISTICS
> #---------------------------------------------------------------------------
> # - Query/Index Statistics Collector -
> stats_start_collector = true
> stats_command_string = true
> stats_block_level = true
> stats_row_level = true
> stats_reset_on_server_start = true
>
> My MySQL *my.ini* has the following non default values:
> innodb_data_home_dir = /var/lib/mysql/
> innodb_data_file_path = ibdata1:10M:autoextend
> innodb_log_group_home_dir = /var/lib/mysql/
> innodb_log_arch_dir = /var/lib/mysql/
> # You can set .._buffer_pool_size up to 50 - 80 %
> # of RAM but beware of setting memory usage too high
> innodb_buffer_pool_size = 512M
> innodb_additional_mem_pool_size = 64M
> # Set .._log_file_size to 25 % of buffer pool size
> innodb_log_file_size = 128M
> innodb_log_buffer_size = 64M
> innodb_flush_log_at_trx_commit = 1
> innodb_lock_wait_timeout = 50
> innodb_flush_method = O_DSYNC
> max_allowed_packet = 16M
>
>
>

In response to

COPY insert performance at 2005-07-25 22:32:50 from Chris Isaacson

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tomeh, Husam	2005-07-25 23:14:16	Re: "Vacuum Full Analyze" taking so long
Previous Message	Luke Lonergan	2005-07-25 22:48:39	Re: "Vacuum Full Analyze" taking so long