Re: Need to update all my 60 million rows at once without transactional integrity

From: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To: christian_behrens(at)gmx(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 22:36:49
Message-ID: DC4DF14B-E45B-4E61-87DD-0353FDDB2DC6@solfertje.student.utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Apr 21, 2008, at 12:19 AM, christian_behrens(at)gmx(dot)net wrote:

> Hi!
>
> How can I make a Update of a column in a very large table for all
> rows without using the double amount of disc space and without any
> need for atomic operation?
>
> I have a very large table with about 60 million rows. I sometimes
> need to do a simple update to ALL rows that resets a status-flag to
> zero.

I'll assume they're wide rows then.

> I don't need to have transactional integrity (but of course if the
> system crashes, there should be no data corruption. A separate flag
> in the file system can well save the fact that that bulk update was
> in progress) for this, I don't care or want a abort or "all or
> nothing" like SQL mandates. The idea is basically that either this
> update succeeds or it succeeds or - there is no "not". It must
> succeed. It must be tried until it works. If the update is halfway
> finished, that's okay.
>
> If I just do an
> UPDATE table SET flag=0;
> then Pg will make a copy of every row which must be cleaned up by
> vaccuum. I understand - and don't like during this specific problem
> - that PG is a MVCC database and will keep my old versions of the
> rows before the update alive. This is very bad.

I'm not sure what you're trying to do exactly, but updating a flag on
every row you changed will double your data size, as update means an
insert/delete combination in MVCC. If there was no reason to update
the row except for changing the flag, that's probably not what you want.

You could instead use referential integrity to do this job for you.
Create a (temporary) table have a foreign key to your records (define
it as both PRIMARY key and FOREIGN key to keep a 1 to 1 relation).
Setting the flag is done by inserting a record into the temp table
referencing the record you "updated". Existence of the record in the
temp table would signify the row was changed.

This way you'll only have inserts of small rows (into the temp
table), and less than 60 million if you didn't succeed.

That still leaves the problem of disabling atomicity of that update
of course. I think you actually mean to also update data in the rows
you want to set those flags for, in which case my suggestion is of
little help...

> If I do a batched loop like this:
> UPDATE table SET flag=0 where id>=0 and id <200;
> UPDATE table SET flag=0 where id>=200 and id <400;
> UPDATE table SET flag=0 where id>=400 and id <600;
> ...
>
> then PG will seek all over my harddrive I think.
>
> It would be much better if it could just start in the beginning of
> the table and work it's way towards the end. But which sort-
> criteria would satisfy this? I don't think that there is any SQL
> that does something like that.

If that table is clustered over an index, that would be a good
candidate as clustering means the data is mostly ordered on disk
according to that index.

> Another ideas (and I think it's a quite good idea) would be to
> drop the column and recreate it with a new default value.
>
> But the problem is that this is not actually MY database, but an
> appliance (which has a harddrive that does not have the double
> amount of space available btw) and it has to work reliably whenever
> something special happens.
>
> And I don't think I should create new columns (the old one would be
> hidden and their internal column ids lost I think) all the time,
> that might have a limit.
>
> Can I do it maybe every day??
>
>
> Is there any other way to go?
>
> I would really like to kind of "skip transactions". Of course basic
> data integrity in on disc structures, but not atomicy for this bulk
> update!
>
> I read that PG has an internal command language that is used to
> build up a database when all the tables for e.g. table-names are
> not present yet.
>
> Could I use that to hack my way around transactions?
>
> Basically I can do everything to this PG installation, this is an
> extremly controlled, locked down environment. I don't need to care
> for anyone else, it's always the same schema, the same version,
> same OS, etc. and I could happily apply any hack that solves the
> problem.
>
> Many thanks for any pointers or ideas how to solve this problem!
> Christian
> --
> Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten
> Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,480fba09927662091310159!

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Martin Gainty 2008-04-23 22:38:06 Re: Stored procedures in C
Previous Message Chris Browne 2008-04-23 22:19:53 Re: Need to update all my 60 million rows at once without transactional integrity