Re: Performance on update from join

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jean-Luc Lachance <jllachan(at)nsd(dot)ca>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Performance on update from join
Date: 2002-05-10 17:44:17
Message-ID: 29566.1021052657@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-sql

Jean-Luc Lachance <jllachan(at)nsd(dot)ca> writes:
> Try:

> create table a (a1 int primary key, a2 int, a3 int, a4 int);
> create table b (b1 int, b2 int, b3 int, b4 int, primary key (b1, b2));
> create table d (d1 int, d2 int, d3 int, d4 int, primary key (d1, d2));
> explain update a set a4 = d.d2 from b,d where a.a2 = b.b1 and a.a3 =
> b.b2 and
> b.b3 = d.d1 and b.b4 = d.d2 and a.a4 >= d.d3 and a.a4 <= d.d4;

> Which is closer to what I have.

> +-----------------------------------
> | /\
> A1 A2 A3 A4 B1 B2 B3 B4 D1 D2 D3 D4
> | | | | | | | |
> +--------------+ | +---------+ |
> | | | |
> +--------------+ +---------+

Well, I still don't see any tendency to want to form a "cartesian
product", if by that you mean an unconstrained join. I tried these
cases:

After creating tables as above:

Hash Join (cost=129.77..162.27 rows=1 width=54)
Hash Cond: ("outer".a2 = "inner".b1)
Join Filter: (("outer".a3 = "inner".b2) AND ("outer".a4 >= "inner".d3) AND ("outer".a4 <= "inner".d4))
-> Seq Scan on a (cost=0.00..20.00 rows=1000 width=22)
-> Hash (cost=129.70..129.70 rows=25 width=32)
-> Merge Join (cost=69.83..129.70 rows=25 width=32)
Merge Cond: (("outer".d1 = "inner".b3) AND ("outer".d2 = "inner".b4))
-> Index Scan using d_pkey on d (cost=0.00..52.00 rows=1000 width=16)
-> Sort (cost=69.83..72.33 rows=1000 width=16)
-> Seq Scan on b (cost=0.00..20.00 rows=1000 width=16)

(I'm using current development sources here because of the nicer EXPLAIN
display, but the planner behavior hasn't changed from 7.2 AFAIR.)

This doesn't tell us very much because the default assumptions for the
table sizes are all the same, 1000 rows/10 pages. I tried goosing it
up:

test=# update pg_class set reltuples=1000000,relpages=10000 where relname = 'a';
UPDATE 1
test=# update pg_class set reltuples=1000000,relpages=10000 where relname = 'b';
UPDATE 1
test=# update pg_class set reltuples=1000000,relpages=10000 where relname = 'd';
UPDATE 1
test=# explain ...

Merge Join (cost=8294766.62..20924766.62 rows=69444444 width=54)
Merge Cond: (("outer".b2 = "inner".a3) AND ("outer".b1 = "inner".a2))
Join Filter: (("inner".a4 >= "outer".d3) AND ("inner".a4 <= "outer".d4))
-> Sort (cost=8093208.78..8155708.78 rows=25000000 width=32)
-> Merge Join (cost=373805.69..758805.69 rows=25000000 width=32)
Merge Cond: (("outer".b4 = "inner".d2) AND ("outer".b3 = "inner".d1))
-> Sort (cost=186902.84..189402.84 rows=1000000 width=16)
-> Seq Scan on b (cost=0.00..20000.00 rows=1000000 width=16)
-> Sort (cost=186902.84..189402.84 rows=1000000 width=16)
-> Seq Scan on d (cost=0.00..20000.00 rows=1000000 width=16)
-> Sort (cost=201557.84..204057.84 rows=1000000 width=22)
-> Seq Scan on a (cost=0.00..20000.00 rows=1000000 width=22)

Or if we make table a smaller than the other two:

test=# update pg_class set reltuples=100000,relpages=1000 where relname = 'a';
UPDATE 1
test=# explain ...

Merge Join (cost=981132.44..2248632.44 rows=6944444 width=54)
Merge Cond: (("outer".d2 = "inner".b4) AND ("outer".d1 = "inner".b3))
Join Filter: (("inner".a4 >= "outer".d3) AND ("inner".a4 <= "outer".d4))
-> Sort (cost=186902.84..189402.84 rows=1000000 width=16)
-> Seq Scan on d (cost=0.00..20000.00 rows=1000000 width=16)
-> Sort (cost=794229.60..800479.60 rows=2500000 width=38)
-> Merge Join (cost=198580.89..241580.89 rows=2500000 width=38)
Merge Cond: (("outer".a3 = "inner".b2) AND ("outer".a2 = "inner".b1))
-> Sort (cost=11678.05..11928.05 rows=100000 width=22)
-> Seq Scan on a (cost=0.00..2000.00 rows=100000 width=22)
-> Sort (cost=186902.84..189402.84 rows=1000000 width=16)
-> Seq Scan on b (cost=0.00..20000.00 rows=1000000 width=16)

The only thing I see here that is unhappy-making is that the planner
does not recognize that joining both columns of a 2-column primary key
guarantees a unique result (for instance, the join of a and b here
should be estimated at 100000 rows, not 2500000, since there can't be
more than one b row joined to any a row). It would do better if it had
any ANALYZE statistics to work with, but ideally it ought to be able to
deduce that even without stats, just from the presence of a unique
index. It will do so for single-column unique indexes but I haven't
yet figured out a reasonable (read inexpensive) way to make the
corresponding deduction for the multi-column case.

In any case, toy examples like this aren't going to prove much
about the behavior with real tables and real statistics ...
if you want constructive answers you're going to need to show
us your actual EXPLAIN ANALYZE results and the pg_stats entries
for the tables.

regards, tom lane

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message dac 2002-05-10 20:03:49 unregister knight@tmilenio.com
Previous Message Nick Fankhauser 2002-05-10 17:24:52 Re: A couple of errors encountered in 7.1.3=>7.2.1-2 data migration

Browse pgsql-sql by date

  From Date Subject
Next Message Samuel J. Sutjiono 2002-05-10 19:38:11 Why very high CPU usage
Previous Message Stephan Szabo 2002-05-10 16:05:11 Re: Bad time external representation ''