Re: update 600000 rows

From: Loïc Marteau <okparanoid(at)free(dot)fr>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: update 600000 rows
Date: 2007-12-15 11:43:10
Message-ID: 4763BDCE.9010705@free.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Steve Crawford wrote:
> okparanoid(at)free(dot)fr wrote:
>> Hello
>>
>> i have a python script to update 600000 rows to one table from a csv
>> file in my
>> postgres database and it takes me 5 hours to do the transaction...
>>
>>
> Let's see if I guessed correctly.
>
> Your Python script is stepping through a 600,000 row file and updating
> information in a table (of unknown rows/columns) by making 600,000
> individual updates all wrapped in a big transaction. If correct, that
> means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this
> is correct, I'd first investigate simply loading the csv data into a
> temporary table, creating appropriate indexes, and running a single
> query to update your other table.

i can try this. The problem is that i have to make an insert if the
update don't have affect a rows (the rows don't exist yet). The number
of rows affected by insert is minor regards to the numbers of updated
rows and was 0 when i test my script). I can do with a temporary table
: update all the possible rows and then insert the rows that are in
temporary table and not in the production table with a 'not in'
statement. is this a correct way ?
>
>> First when i run htop i see that the memory used is never more than
>> 150 MB.
>> I don't understand in this case why setting shmall and shmmax kernel's
>> parameters to 16 GB of memory (the server has 32 GB) increase the
>> rapidity of
>> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
>>
> Are you saying that you did this and the performance improved or you
> are wondering if it would?
>
Yes i did this and the perfomance improved. Dont understand why. Sorry
for my poor english...

>> The script is run with only one transaction and pause by moment to
>> let the time
>> to postgres to write data to disk.
>>
> This doesn't make sense. If the transaction completes successfully
> then PostgreSQL has committed the data to disk (unless you have done
> something non-standard and not recommended like turning off fsync). If
> you are adding pauses between updates, don't do that - it will only
> slow you down. If the full transaction doesn't complete, all updates
> will be thrown away anyway and if it does complete then they were
> committed.

Sorry, the pause is not caused by the python script but by postgres
himself. it does an average of +-3000 update and pause 2 min (htop say
me that postgres is in writing process don't really know if it does io
writing). I say that : if he writes to disk some things during the
transaction i don't understand why ?!
>> If the data were writed at the end of the transaction will be the
>> perfomance
>> better ? i wan't that in production data regulary writed to disk to
>> prevent
>> loosinf of data but it there any interest to write temporary data in
>> disk in a
>> middle of a transaction ???
>>
>>
> See above. Actual disk IO is handled by the server. PostgreSQL is good
> at the "D" in ACID. If your transaction completes, the data has been
> written to disk. Guaranteed.
>
> Cheers,
> Steve
>
>
i try to say that in "normal" use (not when i run this maintenance
script) i want to be sure that by insert update request are write to
disk. They are small (1,2 or 3 rows affected) but they are a lot and
doing by many users. However just for this maintenance script i can
perhaps doing other tweak to adjust the io stress during the transaction ?!

Cheers,

Loic

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2007-12-15 15:47:02 Re: update 600000 rows
Previous Message Jeff Davis 2007-12-15 03:53:14 Re: explanation for seeks in VACUUM