Skip site navigation (1) Skip section navigation (2)

Re: update 600000 rows

From: Loïc Marteau <okparanoid(at)free(dot)fr>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: update 600000 rows
Date: 2007-12-15 11:43:10
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
Steve Crawford wrote:
> okparanoid(at)free(dot)fr wrote:
>> Hello
>> i have a python script to update 600000 rows to one table from a csv 
>> file in my
>> postgres database and it takes me 5 hours to do the transaction...
> Let's see if I guessed correctly.
> Your Python script is stepping through a 600,000 row file and updating 
> information in a table (of unknown rows/columns) by making 600,000 
> individual updates all wrapped in a big transaction. If correct, that 
> means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this 
> is correct, I'd first investigate simply loading the csv data into a 
> temporary table, creating appropriate indexes, and running a single 
> query to update your other table.

i can try this. The problem is that i have to make an insert if the 
update don't have affect a rows (the rows don't exist yet). The number 
of rows affected by insert is minor regards to the numbers of updated 
rows and was 0 when i test my script). I can do with a temporary table  
: update all the possible rows and then insert the rows that are in 
temporary table and not in the production table with a 'not in' 
statement. is this a correct way ?
>> First when i run htop i see that the memory used is never more than 
>> 150 MB.
>> I don't understand in this case why setting shmall and shmmax kernel's
>> parameters to 16 GB of memory (the server has 32 GB) increase the 
>> rapidity of
>> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
> Are you saying that you did this and the performance improved or you 
> are wondering if it would?
Yes i did this and the perfomance improved. Dont understand why. Sorry 
for my poor english...

>> The script is run with only one transaction and pause by moment to 
>> let the time
>> to postgres to write data to disk.
> This doesn't make sense. If the transaction completes successfully 
> then PostgreSQL has committed the data to disk (unless you have done 
> something non-standard and not recommended like turning off fsync). If 
> you are adding pauses between updates, don't do that - it will only 
> slow you down. If the full transaction doesn't complete, all updates 
> will be thrown away anyway and if it does complete then they were 
> committed.

Sorry, the pause is not caused by the python script but by postgres 
himself. it does an average of +-3000 update and pause 2 min (htop say 
me that postgres is in writing process don't really know if it does io 
writing). I say that : if he writes to disk some things during the 
transaction i don't understand why ?!
>> If the data were writed at the end of the transaction will be the 
>> perfomance
>> better ? i wan't that in production data regulary writed to disk to 
>> prevent
>> loosinf of data but it there any interest to write temporary data in 
>> disk in a
>> middle of a transaction ???
> See above. Actual disk IO is handled by the server. PostgreSQL is good 
> at the "D" in ACID. If your transaction completes, the data has been 
> written to disk. Guaranteed.
> Cheers,
> Steve
i try to say that in "normal" use (not when i run this maintenance 
script) i want to be sure that by insert update request are write to 
disk. They are small (1,2 or 3 rows affected) but they are a lot and 
doing by many users. However just for this maintenance script i can 
perhaps doing other tweak to adjust the io stress during the transaction ?!



In response to

pgsql-performance by date

Next:From: Greg SmithDate: 2007-12-15 15:47:02
Subject: Re: update 600000 rows
Previous:From: Jeff DavisDate: 2007-12-15 03:53:14
Subject: Re: explanation for seeks in VACUUM

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group