Re: Long Running Update - My Solution

From: Harry Mantheakis <harry(dot)mantheakis(at)riskcontrollimited(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-performance(at)postgresql(dot)org, greg(at)2ndquadrant(dot)com, shortcutter(at)googlemail(dot)com, tv(at)fuzzy(dot)cz
Subject: Re: Long Running Update - My Solution
Date: 2011-06-28 09:48:57
Message-ID: 4E09A389.80604@riskcontrollimited.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello Kevin

> If you use EXPLAIN with both statements...

Yes, the plans are indeed very different.

Here is the statement, set to update up to 100,000 records, which took
about 5 seconds to complete:

UPDATE
table_A
SET
field_1 = table_B.field_1
, field_2 = table_B.field_2
FROM
table_B
WHERE
table_B.tb_id >= 0
AND
table_B.tb_id <= 100000
AND
table_B.tb_id = table_A.ta_id
;

The query plan for the above is:

Nested Loop (cost=0.00..2127044.47 rows=73620 width=63)
-> Index Scan using table_B_pkey on table_B (cost=0.00..151830.75
rows=73620 width=20)
Index Cond: ((tb_id >= 0) AND (tb_id <= 100000))
-> Index Scan using table_A_pkey on table_A (cost=0.00..26.82
rows=1 width=47)
Index Cond: (table_A.ta_id = table_B.tb_id)

Now, if I change the first AND clause to update 1M records, as follows:

table_B.id <= 1000000

I get the following - quite different - query plan:

Hash Join (cost=537057.49..8041177.88 rows=852150 width=63)
Hash Cond: (table_A.ta_id = table_B.tb_id)
-> Seq Scan on table_A (cost=0.00..3294347.71 rows=145561171 width=47)
-> Hash (cost=521411.62..521411.62 rows=852150 width=20)
-> Bitmap Heap Scan on table_B (cost=22454.78..521411.62
rows=852150 width=20)
Recheck Cond: ((tb_id >= 0) AND (tb_id <= 1000000))
-> Bitmap Index Scan on table_B_pkey
(cost=0.00..22241.74 rows=852150 width=0)
Index Cond: ((tb_id >= 0) AND (tb_id <= 1000000))

Note: When I tried updating 1M records, the command was still running
after 25 minutes before I killed it.

The sequential scan in the later plan looks expensive, and (I think)
supports what others have since mentioned, namely that when the
optimizer moves to using sequential scans (working off the disk) things
get a lot slower.

For me, the penny has finally dropped on why I should use EXPLAIN for
bulk operations.

Thanks too, to Greg Smith, Robert Klemme and Thomas for all the feedback.

Kind regards

Harry Mantheakis
London, UK

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Craig McIlwee 2011-06-28 21:28:51 Slow performance when querying millions of rows
Previous Message Greg Smith 2011-06-28 00:37:08 Re: Long Running Update - My Solution