Re: Need to update all my 60 million rows at once without transactional integrity

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Chris Browne" <cbbrowne(at)acm(dot)org>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-24 00:02:30
Message-ID: D425483C2C5C9F49B5B7A41F8944154701000DEE@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-
> owner(at)postgresql(dot)org] On Behalf Of Chris Browne
> Sent: Wednesday, April 23, 2008 3:20 PM
> To: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Need to update all my 60 million rows at once
> without transactional integrity
>
> christian_behrens(at)gmx(dot)net writes:
> > How can I make a Update of a column in a very large table for all
> > rows without using the double amount of disc space and without any
> > need for atomic operation?
>
> You may need to redefine the problem.
>
> > I have a very large table with about 60 million rows. I sometimes
> > need to do a simple update to ALL rows that resets a status-flag to
> > zero.
>
> That seems like a pretty useless update to me...
>
> Why not instead redefine the "status-we-need-to-have" to be zero?
> [e.g. - don't try to turn off the Sun; instead, define "brightness" as
> the standard ;-)]
>
> > I don't need to have transactional integrity (but of course if the
> system crashes, there should be no data corruption. A separate flag in
the
> file system can well save the fact that that bulk update was in
progress)
> for this, I don't care or want a abort or "all or nothing" like SQL
> mandates. The idea is basically that either this update succeeds or it
> succeeds or - there is no "not". It must succeed. It must be tried
until
> it works. If the update is halfway finished, that's okay.
> >
> > If I just do an
> > UPDATE table SET flag=0;
> > then Pg will make a copy of every row which must be cleaned up by
> vaccuum. I understand - and don't like during this specific problem -
that
> PG is a MVCC database and will keep my old versions of the rows before
the
> update alive. This is very bad.
> >
> > If I do a batched loop like this:
> > UPDATE table SET flag=0 where id>=0 and id <200;
> > UPDATE table SET flag=0 where id>=200 and id <400;
> > UPDATE table SET flag=0 where id>=400 and id <600;
> > ...
> >
> > then PG will seek all over my harddrive I think.
>
> If I *were* going to do this, I'd want to do:
>
> update table set flag=0 where flag <> 0 and id in (select id from
table
> where flag <> 0 limit 50000);
>
> which will go through the table 50K rows at a time.
>
> > It would be much better if it could just start in the beginning of
> > the table and work it's way towards the end. But which sort-criteria
> > would satisfy this? I don't think that there is any SQL that does
> > something like that.
>
> You couldn't guarantee physical ordering anyways, so that seems
> futile.
>
> > Another ideas (and I think it's a quite good idea) would be to
> > drop the column and recreate it with a new default value.
>
> The "column drop" would be virtually instantaneous; it would
> essentially be hidden from view.
>
> But the addition of the column would rewrite the table, doubling its
> size :-(.
>
> > And I don't think I should create new columns (the old one would be
> > hidden and their internal column ids lost I think) all the time,
> > that might have a limit.
>
> That's correct. It will eventually cause a problem.
>
> > Is there any other way to go?
>
> Step back to what value you're putting into that column, and why.
>
> Perhaps you never need to change this value.
>
> In the Slony-I replication system, we have something sort of similar;
> we collect a table of entries that need to get applied to another
> database. (Table called sl_log_1.) We *NEVER* alter its contents;
> that is not necessary to indicate that data has been replicated (which
> is the status of interest).
>
> Instead of updating tuples to mark that they are processed, we instead
> store information in another table that indicates up to which point in
> time we have finished replicating. (It's more complex and indirect
> than that, but nonetheless, it's still a fair characterization...)
>
> So, perhaps you should be storing a per-day value in the table, and
> store, somewhere else, what point you're "done up to."

I have a notion here...

If there is a table which gets some subset of its columns completely
rewritten every day, then I suggest taking the completely rewritten
columns and make them into a child table. Take the primary key of the
current table with the problem and clone that primary key for the child
table.

E.g.:

Table foo has columns:
a,b,c,d,e,f,g,h,I,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
We update p,t,y every day, 100%
Primary key of foo is a,b

I suggest creating bar (a,b,p,t,y) and remove p,t,y from main table foo
and rename table foo to foo_1

Create a view that has the same name as foo and the columns of the
original foo table by joining tables foo_1 and bar.

Every time you repopulate the table, drop and recreate bar, COPY INTO
and then recreate bar's index on (a,b).

Your applications will work the same because the view foo has the same
properties as the original table foo.

You could even populate the bar table offline, rename the old bar table
to bar_old and rename the new populated table from bar_new to bar and so
even the apparent downtime would be very small.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Vyacheslav Kalinin 2008-04-24 01:33:21 Column order
Previous Message brian 2008-04-23 23:46:07 Re: How to modify ENUM datatypes?