bulk load performance question

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: bulk load performance question
Date: 2010-10-12 21:16:11
Message-ID: AANLkTikC4KWPAo5w3kB8a62Hc=yR=ikJ3vn52qCSD9Ub@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I've got a few tables that periodically get entirely refreshed via a COPY. I
don't really have a good mechanism for detecting only rows which have
changed so even though the differences are small, a full copy is easiest.
However, the data includes a primary key column, so I can't simply load
into the existing table and then drop older rows. So we load into a table
with a different name and then, within a transaction, drop the old and
rename the new. However, while the transaction will cause a query against
that table to block until the transaction commits, when the transaction
commits, the blocked query will fail with an error message like: ERROR:
could not open relation with OID 17556

Is there some way to do the drop+rename in a manner which will preserve the
OID or otherwise allow blocked queries to execute correctly once they
unblock?

A secondary issue is that if permissions were granted to a role on the old
table, the new table does not acquire those permissions and they must be
granted again.

The biggest table that gets updated like this is a couple hundred thousand
rows, with maybe a few thousand rows actually changing or being added with
each load. Suggestions for alternative mechanisms for doing the loading are
welcome. I'd really rather avoid updating every row in a several hundred
thousand row table, especially without easy upsert functionality. The data
is small enough that selecting everything and then comparing in memory
before updating modified rows is doable, but sure seems like a lot of work
if it can be avoided.

Writing this caused me to think of a possible solution, which appears to
work correctly, but I'd like to confirm it with folks in the know:

Instead of this:

CREATE TABLE mytable_temp...;
COPY INTO mytable_temp...;
BEGIN;
DROP TABLE mytable;
ALTER TABLE mytable_temp RENAME TO mytable;
COMMIT;

Which will cause any overlapping queries to pick up the wrong OID for
mytable and then fail when the transaction commits, I tested this:

COPY INTO mytable_temp;
BEGIN;
ALTER TABLE mytable RENAME TO mytable_old;
ALTER TABLE mytable_temp RENAME TO mytable;
COMMIT;
DROP TABLE mytable_old;

It would appear that any query that uses mytable which overlaps with the
transaction will pick up the OID of the original mytable and then block
until the transaction commits. WHen the transaction commits, those queries
will successfully run against the original OID (no queries write to this
table except for the bulk load) and will complete, at which time, the table
drop will finally complete. Meanwhile, any queries which don't overlap (or
perhaps any queries which start after the rename from mytable_temp to
mytable has occurred) will successfully complete against the new table.

The net result appears to be that I will no longer suffer the missing OID
error, which seemed to periodically completely hose a db connection,
requiring that the connection be closed since no subequent queries would
ever succeed, whether they touched the table in question or not. I've only
seen that erroneous behaviour on 8.3 (so far - we only recently upgraded to
8.4.4), but it was fairly mysterious because I've never been able to
replicate it in testing. I could get a single missing OID error, but never
one that would break all subsequent queries.

Are my assumptions about this correct?

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2010-10-12 21:16:20 Re: read only transactions
Previous Message Chris Browne 2010-10-12 20:32:31 Re: read only transactions