integrity check and visibility was: COPY equivalent for updates

From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-sql(at)postgresql(dot)org
Cc: Decibel! <decibel(at)decibel(dot)org>
Subject: integrity check and visibility was: COPY equivalent for updates
Date: 2008-07-16 09:06:11
Message-ID: 20080716110611.3337db10@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Tue, 15 Jul 2008 22:13:42 -0500
Decibel! <decibel(at)decibel(dot)org> wrote:

> Depending on what you're doing, it might well be fastest to...

> BEGIN;
> COPY temp_table FROM 'file';
> DELETE FROM real_table WHERE real_table_id IN (SELECT
> real_table_id FROM temp_table);
> INSERT INTO real_table SELECT * FROM temp_table;
> COMMIT;

Thank you for making me reconsider this approach.
I gave up since I have a lot of ON DELETE CASCADE... but you made me
remind that there should be a way to delay the checks at the end of
the transaction (I still couldn't find the syntax on the manual).

Actually from my understanding an UPDATE in PostgreSQL is a DELETE +
INSERT. But wouldn't the subquery
DELETE FROM real_table WHERE real_table_id IN (SELECT real_table_id
FROM temp_table);

take longer to execute than the UPDATE version:
update t1 set col1=temp_t1.col1, col2=temp_t1.col2
where t1.id=temp_t1.id;
mutatis mutandis?

I'd say that pg should be enough smart to take advantage of a unique
key on real_table.real_table_id and stop searching as soon as it
deleted the first occurrence of an ID found in real_table from
temp_table. Is it?
Would adding an unique index on temp_table make the process faster
or slower? I know that temp_table.real_table_id will be unique so
I'm not interested in the integrity check (that anyway should take
place later). At first glance the unique index on
real_table.real_table_id may be useful, the one on temp_table should
not and will actually make the process slower adding index creation
time.

So the "ON DELETE CASCADE/SET NULL" problem should be solved once I
find the syntax to delay checks
Can it be solved this way? I'm going to check as soon as I find the
syntax on the manual.

What about visibility of the "going to be deleted rows"?

If another transaction start between the DELETE and the INSERT
statement are executed... what is it going to see?
What if I don't have just a single INSERT statement since some
columns will be moved to another table?

Does this make necessary the UPDATE approach in spite of the
DELETE/INSERT approach?
If I had temp_table split into 2 tables eg.

create temp_table(
id int,
col11 varchar(32),
col21 varchar(32)
);
create table table1 (
id int primary key,
col11 varchar(32)
);
create table table2 (
id int references table1(id) on delete cascade,
col21 varchar(32)
);

I wouldn't be concerned that data in table1 and table2 contains data
updated in different times and "not coherent" rather I'd be
concerned there is a row in table1 but there isn't in table2.
Furthermore... if I delay checks and ON DELETE triggers rows in
table2 won't be deleted when I delete rows in table1 unless I do it
"manually".

At this point I'm more concerned of maintainability over speed.
Actually if I follow the UPDATE approach and temp_table is still
split across several tables I'll have to write an update for each
tableN anyway... so there shouldn't be too much difference between
the UPDATE and the DELETE/INSERT approach.

What about the effect of a longer transaction compared to a list of
updates on speed?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2008-07-16 14:29:30 Re: Problem with ORDER BY and DISTINCT ON
Previous Message Steve Midgley 2008-07-16 07:39:47 Problem with ORDER BY and DISTINCT ON