Skip site navigation (1) Skip section navigation (2)

Re: Improve BULK insertion

From: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>
To: <grupos(at)carvalhaes(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Improve BULK insertion
Date: 2004-12-05 22:48:28
Message-ID: 71E37EF6B7DCC1499CEA0316A256832801D4BBDF@loki.wc.globexplorer.net (view raw or flat)
Thread:
Lists: pgsql-performance
Rodrigo --

You should definitely drop the indexes and any other FK constraints before loading and then rebuild them. Check your logs and see if there are warnings about checkpoint intervals -- only 3 logs seems like it might be small; if you have the disk space I would definitely consider raising the number. If you haven't already posted your config settings you might do so -- this seems very slow. I regularly use COPY to load or unload data sets in the 200k-900k range and they don't take 90 minutes, even on slower hardware (and usually only a few minutes on our production servers; rebuilding the indexes usually takes longer.  

This unloading a 300k+ row data set on a dell linux box with not very good disks and 1 gig of RAM:

Starting copy of parcel staging table parcels_12031 at Thu Dec  2 01:13:52 2004
Done with staging table copy at Thu Dec  2 01:15:16 2004
...
Starting compression of parcel file at Thu Dec  2 01:15:22 2004
gzip: /tmp/parcels_12031.unl.gz already exists; do you wish to overwrite (y or n
)? y
Done with compression of parcel file at Thu Dec  2 01:17:23 2004
...

And loading them on a rather faster server:

Starting unzip of parcels at Thu Dec  2 01:29:15 2004
Finished with unzip at Thu Dec  2 01:29:22 2004
...
Target db detail table updated at Thu Dec  2 01:29:29 2004
Dropping indexes
Dropping fk constraint on tracking id
Dropping indexes
Done dropping indexes on target parcels table at Thu Dec  2 01:29:30 2004
NOTICE:  drop cascades to table f12031.parcel_pins
NOTICE:  drop cascades to table f12031.parcel_addresses
NOTICE:  drop cascades to table f12031.parcel_owner_fti
NOTICE:  drop cascades to table f12031.parcel_owners
Removing old parcels entries starting at Thu Dec  2 01:29:30 2004
Done deleting schema and parcels for track_id 10163541 at Thu Dec  2 01:33:04 2004
Starting load of parcels at Thu Dec  2 01:33:04 2004
Done copying data into parcels at Thu Dec  2 01:35:18 2004
Deleting old v_detail reference for track_id 10163541
Done with delete of old v_detail reference
Starting creation of foreign key constraint at Thu Dec  2 01:39:43 2004
Done with creation of foreign key constraint at Thu Dec  2 01:42:14 2004
Starting spatial index create at Thu Dec  2 01:42:14 2004
Done creating spatial index at Thu Dec  2 01:55:04 2004
Starting stats on geometry column now
Done doing stats for spatial index at Thu Dec  2 02:03:47 2004
Starting index on PIN now
Done creating pin index at Thu Dec  2 02:09:36 2004
Starting index on tracking id now
Done creating trid index at Thu Dec  2 02:12:35 2004
Starting centroid index now
Done creating centroid index at Thu Dec  2 02:24:11 2004
Starting stats on centroid column
Done doing stats for spatial index at Thu Dec  2 02:29:55 2004
Doing City/Street Index on parcels table ...Done creating city/street index at Thu Dec  2 02:42:41 2004 with result <-1>
Committing changes

So this took about 70 minutes to delete 200000+ rows from a table with about 5 million rows, load a new set and reindex them (and do some statistics for spatial geometry). If the table had only this data the indexing would have been *much* faster. These are moderate size columns -- about 2 dozen columns and some spatial data (polygon and point). Both servers have rather more log files than your setup, but I am not familiar enough with postgres to know how much of an impact that alone will have. The comment about it slowing down part way through a load makes me suspect indexing issues, somehow (not from postgres experience but it rings a bell with other DBs); if you explicitly drop the indexes first and then load does it show the same performance behavior ?

If you are doing the data read from, the database write and the WAL logging all on single disk drive, then I would guess that that is your bottleneck. If you use vmstat and/or top or the like, is your I/O pegged ?

HTH

Greg WIlliamson
DBA
GlobeXplorer LLC

-----Original Message-----
From:	Rodrigo Carvalhaes [mailto:grupos(at)carvalhaes(dot)net]
Sent:	Sun 12/5/2004 11:52 AM
To:	Christopher Browne
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] Improve BULK insertion
Hi!

1. I am doing the inserts using pg_restore. The dump was created using 
pg_dump and the standard format (copy statements)
2. See below the table schema. There are only 7 indexes.  
3. My transaction log configuration are : checkpoint_segments = 3  and 
checkpoint_timeout = 300 and my transaction logs are on the same disk .

I know that I can increase the performance separating the transaction 
logs and making a RAID 5 array BUT I am really curious about WHY this 
performance is so poor and HOW can I try to improve on this actual 
machine because actualy this inserts are taking around 90 minutes!!!

Cheers!

Rodrigo

dadosadv=# \d si2010
                                        Table "public.si2010"
   Column   |       Type       |                              Modifiers
------------+------------------+---------------------------------------------------------------------
 i2_filial  | character(2)     | not null default '  '::bpchar
 i2_num     | character(10)    | not null default '          '::bpchar
 i2_linha   | character(2)     | not null default '  '::bpchar
 i2_data    | character(8)     | not null default '        '::bpchar
 i2_dc      | character(1)     | not null default ' '::bpchar
 i2_debito  | character(20)    | not null default '                    
'::bpchar
 i2_dcd     | character(1)     | not null default ' '::bpchar
 i2_credito | character(20)    | not null default '                    
'::bpchar
 i2_dcc     | character(1)     | not null default ' '::bpchar
 i2_moedas  | character(5)     | not null default '     '::bpchar
 i2_valor   | double precision | not null default 0.0
 i2_hp      | character(3)     | not null default '   '::bpchar
 i2_hist    | character(40)    | not null default 
'                                        '::bpchar
 i2_ccd     | character(9)     | not null default '         '::bpchar
 i2_ccc     | character(9)     | not null default '         '::bpchar
 i2_ativdeb | character(6)     | not null default '      '::bpchar
 i2_ativcrd | character(6)     | not null default '      '::bpchar
 i2_vlmoed2 | double precision | not null default 0.0
 i2_vlmoed3 | double precision | not null default 0.0
 i2_vlmoed4 | double precision | not null default 0.0
 i2_vlmoed5 | double precision | not null default 0.0
 i2_dtvenc  | character(8)     | not null default '        '::bpchar
 i2_criter  | character(4)     | not null default '    '::bpchar
 i2_rotina  | character(8)     | not null default '        '::bpchar
 i2_periodo | character(6)     | not null default '      '::bpchar
 i2_listado | character(1)     | not null default ' '::bpchar
 i2_origem  | character(40)    | not null default 
'                                        '::bpchar
 i2_permat  | character(4)     | not null default '    '::bpchar
 i2_filorig | character(2)     | not null default '  '::bpchar
 i2_intercp | character(1)     | not null default ' '::bpchar
 i2_identcp | character(12)    | not null default '            '::bpchar
 i2_lote    | character(4)     | not null default '    '::bpchar
 i2_doc     | character(6)     | not null default '      '::bpchar
 i2_emporig | character(2)     | not null default '  '::bpchar
 i2_lp      | character(3)     | not null default '   '::bpchar
 i2_itemd   | character(9)     | not null default '         '::bpchar
 i2_itemc   | character(9)     | not null default '         '::bpchar
 i2_prelan  | character(1)     | not null default ' '::bpchar
 i2_tipo    | character(2)     | not null default '  '::bpchar
 i2_dcc     | character(1)     | not null default ' '::bpchar
 i2_moedas  | character(5)     | not null default '     '::bpchar
 i2_valor   | double precision | not null default 0.0
 i2_hp      | character(3)     | not null default '   '::bpchar
 i2_hist    | character(40)    | not null default 
'                                        '::bpchar
 i2_ccd     | character(9)     | not null default '         '::bpchar
 i2_ccc     | character(9)     | not null default '         '::bpchar
 i2_ativdeb | character(6)     | not null default '      '::bpchar
 i2_ativcrd | character(6)     | not null default '      '::bpchar
 i2_vlmoed2 | double precision | not null default 0.0
 i2_vlmoed3 | double precision | not null default 0.0
 i2_vlmoed4 | double precision | not null default 0.0
 i2_vlmoed5 | double precision | not null default 0.0
 i2_dtvenc  | character(8)     | not null default '        '::bpchar
 i2_criter  | character(4)     | not null default '    '::bpchar
 i2_rotina  | character(8)     | not null default '        '::bpchar
 i2_periodo | character(6)     | not null default '      '::bpchar
 i2_listado | character(1)     | not null default ' '::bpchar
 i2_origem  | character(40)    | not null default 
'                                        '::bpchar
 i2_permat  | character(4)     | not null default '    '::bpchar
 i2_filorig | character(2)     | not null default '  '::bpchar
 i2_intercp | character(1)     | not null default ' '::bpchar
 i2_identcp | character(12)    | not null default '            '::bpchar
 i2_lote    | character(4)     | not null default '    '::bpchar
 i2_doc     | character(6)     | not null default '      '::bpchar
 i2_emporig | character(2)     | not null default '  '::bpchar
 i2_lp      | character(3)     | not null default '   '::bpchar
 i2_itemd   | character(9)     | not null default '         '::bpchar
 i2_itemc   | character(9)     | not null default '         '::bpchar
 i2_prelan  | character(1)     | not null default ' '::bpchar
 i2_tipo    | character(2)     | not null default '  '::bpchar
 d_e_l_e_t_ | character(1)     | not null default ' '::bpchar
 r_e_c_n_o_ | double precision | not null default 0.0
Indexes:
    "si2010_pkey" primary key, btree (r_e_c_n_o_)
    "si20101" btree (i2_filial, i2_num, i2_linha, i2_periodo, 
r_e_c_n_o_, d_e_l_e_t_)
    "si20102" btree (i2_filial, i2_periodo, i2_num, i2_linha, 
r_e_c_n_o_, d_e_l_e_t_)
    "si20103" btree (i2_filial, i2_data, i2_num, i2_linha, r_e_c_n_o_, 
d_e_l_e_t_)
    "si20104" btree (i2_filial, i2_debito, i2_data, i2_num, i2_linha, 
r_e_c_n_o_, d_e_l_e_t_)
    "si20105" btree (i2_filial, i2_credito, i2_data, i2_num, i2_linha, 
r_e_c_n_o_, d_e_l_e_t_)
    "si20106" btree (i2_filial, i2_doc, i2_periodo, r_e_c_n_o_, d_e_l_e_t_)
    "si20107" btree (i2_filial, i2_origem, r_e_c_n_o_, d_e_l_e_t_)


Christopher Browne wrote:

>In the last exciting episode, grupos(at)carvalhaes(dot)net (Grupos) wrote:
>  
>
>>Hi !
>>
>>I need to insert 500.000 records on a table frequently. It´s a bulk
>>insertion from my applicatoin.
>>I am with a very poor performance. PostgreSQL insert very fast until
>>the tuple 200.000 and after it the insertion starts to be really slow.
>>I am seeing on the log and there is a lot of transaction logs,
>>something like :
>>
>>2004-12-04 11:08:59 LOG:  recycled transaction log file "0000000600000012"
>>2004-12-04 11:08:59 LOG:  recycled transaction log file "0000000600000013"
>>2004-12-04 11:08:59 LOG:  recycled transaction log file "0000000600000011"
>>2004-12-04 11:14:04 LOG:  recycled transaction log file "0000000600000015"
>>2004-12-04 11:14:04 LOG:  recycled transaction log file "0000000600000014"
>>2004-12-04 11:19:08 LOG:  recycled transaction log file "0000000600000016"
>>2004-12-04 11:19:08 LOG:  recycled transaction log file "0000000600000017"
>>2004-12-04 11:24:10 LOG:  recycled transaction log file "0000000600000018"
>>    
>>
>
>It is entirely normal for there to be a lot of transaction log file
>recycling when bulk inserts are taking place; that goes through a lot
>of transaction logs.
>
>  
>
>>How can I configure PostgreSQL to have a better performance on this
>>bulk insertions ? I already increased the memory values.
>>    
>>
>
>Memory is, as likely as not, NOT the issue.
>
>Two questions:
>
> 1.  How are you doing the inserts?  Via INSERT statements?  Or
>     via COPY statements?  What sort of transaction grouping
>     is involved?
>
>     COPY is way faster than INSERT, and grouping plenty of updates
>     into a single transaction is generally a "win."
>
> 2.  What is the schema like?  Does the table have a foreign key
>     constraint?  Does it have a bunch of indices?
>
>     If there should eventually be lots of indices, it tends to be
>     faster to create the table with none/minimal indices, and add
>     indexes afterwards, as long as your "load" process can be trusted
>     to not break "unique" constraints...
>
>     If there is some secondary table with a foreign key constraint,
>     and _that_ table is growing, it is possible that a sequential
>     scan is being used to search the secondary table where, if you
>     did an ANALYZE on that table, an index scan would be preferred
>     once it grew to larger size...
>
>There isn't a particular reason for PostgreSQL to "hit a wall" upon
>seeing 200K records; I and coworkers routinely load database dumps
>that have millions of (sometimes pretty fat) records, and they don't
>"choke."  That's true whether talking about loading things onto my
>(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAID
>array, or higher end stuff involving high end SMP and EMC disk arrays.
>The latter obviously being orders of magnitude faster than desktop
>equipment :-).
>  
>

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)




pgsql-performance by date

Next:From: Josh BerkusDate: 2004-12-05 23:06:40
Subject: Re: Partitioned table performance
Previous:From: Rodrigo CarvalhaesDate: 2004-12-05 19:52:03
Subject: Re: Improve BULK insertion

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group