Need to update all my 60 million rows at once without transactional integrity

From: christian_behrens(at)gmx(dot)net
To: pgsql-general(at)postgresql(dot)org
Subject: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-20 22:19:34
Message-ID: 20080420221934.13270@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

How can I make a Update of a column in a very large table for all rows without using the double amount of disc space and without any need for atomic operation?

I have a very large table with about 60 million rows. I sometimes need to do a simple update to ALL rows that resets a status-flag to zero.

I don't need to have transactional integrity (but of course if the system crashes, there should be no data corruption. A separate flag in the file system can well save the fact that that bulk update was in progress) for this, I don't care or want a abort or "all or nothing" like SQL mandates. The idea is basically that either this update succeeds or it succeeds or - there is no "not". It must succeed. It must be tried until it works. If the update is halfway finished, that's okay.

If I just do an
UPDATE table SET flag=0;
then Pg will make a copy of every row which must be cleaned up by vaccuum. I understand - and don't like during this specific problem - that PG is a MVCC database and will keep my old versions of the rows before the update alive. This is very bad.

If I do a batched loop like this:
UPDATE table SET flag=0 where id>=0 and id <200;
UPDATE table SET flag=0 where id>=200 and id <400;
UPDATE table SET flag=0 where id>=400 and id <600;
...

then PG will seek all over my harddrive I think.

It would be much better if it could just start in the beginning of the table and work it's way towards the end. But which sort-criteria would satisfy this? I don't think that there is any SQL that does something like that.

Another ideas (and I think it's a quite good idea) would be to
drop the column and recreate it with a new default value.

But the problem is that this is not actually MY database, but an appliance (which has a harddrive that does not have the double amount of space available btw) and it has to work reliably whenever something special happens.

And I don't think I should create new columns (the old one would be hidden and their internal column ids lost I think) all the time, that might have a limit.

Can I do it maybe every day??

Is there any other way to go?

I would really like to kind of "skip transactions". Of course basic data integrity in on disc structures, but not atomicy for this bulk update!

I read that PG has an internal command language that is used to build up a database when all the tables for e.g. table-names are not present yet.

Could I use that to hack my way around transactions?

Basically I can do everything to this PG installation, this is an extremly controlled, locked down environment. I don't need to care for anyone else, it's always the same schema, the same version, same OS, etc. and I could happily apply any hack that solves the problem.

Many thanks for any pointers or ideas how to solve this problem!
Christian
--
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2008-04-21 01:25:38 Re: Changed Hosts, Lots of Errors in PostgreSQL - Help Please!
Previous Message Dawid Kuroczko 2008-04-20 21:56:15 Re: In the belly of the beast (MySQLCon)