Re: Massive table (500M rows) update nightmare

From: "Carlo Stonebanks" <stonec(dot)register(at)sympatico(dot)ca>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Massive table (500M rows) update nightmare
Date: 2010-01-08 17:38:46
Message-ID: hi7qho$2v37$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

>I thought that post mentioned that the plan
> was one statement in an iteration, and that the cache would have
> been primed by a previous query checking whether there were any rows
> to update. If that was the case, it might be worthwhile to look at
> the entire flow of an iteration.

This is the only SQL query in the code in question - the rest of the code
manages the looping and commit. The code was copied to PgAdminIII and values
written in for the WHERE clause. In order for me to validate that rows would
have been updated, I had to run a SELECT with the same WHERE clause in
PgAdminIII first to see how many rows would have qualified. But this was for
testing purposes only. The SELECT statement does not exist in the code. The
vast majority of the rows that will be processed will be updated as this is
a backfill to synch the old rows with the values being filled into new
columns now being inserted.

> Also, if you ever responded with version and configuration
> information, I missed it.

This is hosted on a new server the client set up so I am waiting for the
exact OS and hardware config. PG Version is PostgreSQL 8.3.6, compiled by
Visual C++ build 1400, OS appears to be Windows 2003 x64 Server.

More than anything, I am more concerned with the long-term use of the
system. This particular challenge with the 500M row update is one thing, but
I am concerned about the exceptional effort required to do this. Is it
REALLY this exceptional to want to update 500M rows of data in this day and
age? Or is the fact that we are considering dumping and restoring and
dropping indexes, etc to do all an early warning that we don't have a
solution that is scaled to the problem?

Config data follows (I am assuming commented values which I did not include
are defaulted).

Carlo

autovacuum = on
autovacuum_analyze_scale_factor = 0.1
autovacuum_analyze_threshold = 250
autovacuum_naptime = 1min
autovacuum_vacuum_scale_factor = 0.2
autovacuum_vacuum_threshold = 500
bgwriter_lru_maxpages = 100
checkpoint_segments = 64
checkpoint_warning = 290
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
log_destination = 'stderr'
log_line_prefix = '%t '
logging_collector = on
maintenance_work_mem = 16MB
max_connections = 200
max_fsm_pages = 204800
max_locks_per_transaction = 128
port = 5432
shared_buffers = 500MB
vacuum_cost_delay = 100
work_mem = 512MB

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Kaloyan Iliev Iliev 2010-01-08 17:58:02 Change query join order
Previous Message Andreas Kretschmer 2010-01-08 17:21:47 Re: Massive table (500M rows) update nightmare