Re: Slow query with backwards index scan

From: Tilmann Singer <tils-pgsql(at)tils(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow query with backwards index scan
Date: 2007-07-28 12:52:36
Message-ID: 20070728125236.GD19392@tils.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

* Nis Jørgensen <nis(at)superlativ(dot)dk> [20070727 20:31]:
> How does the "obvious" UNION query do - ie:
>
> SELECT * FROM (
> SELECT * FROM large_table lt
> WHERE lt.user_id = 12345
>
> UNION
>
> SELECT * FROM large_table lt
> WHERE user_id IN (SELECT contact_id FROM relationships WHERE user_id=12345)
> ) q
>
> ORDER BY created_at DESC LIMIT 10;

Great for the user with little data:

testdb=# EXPLAIN ANALYZE SELECT * FROM (
SELECT * FROM large_table lt
WHERE lt.user_id = 12345
UNION
SELECT * FROM large_table lt
WHERE user_id IN (SELECT contact_id FROM relationships WHERE user_id=12345)
) q
ORDER BY created_at DESC LIMIT 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=14673.77..14673.80 rows=10 width=3140) (actual time=133.877..133.946 rows=10 loops=1)
-> Sort (cost=14673.77..14682.22 rows=3378 width=3140) (actual time=133.870..133.894 rows=10 loops=1)
Sort Key: q.created_at
-> Unique (cost=14315.34..14442.01 rows=3378 width=622) (actual time=133.344..133.705 rows=38 loops=1)
-> Sort (cost=14315.34..14323.78 rows=3378 width=622) (actual time=133.337..133.432 rows=38 loops=1)
Sort Key: id, user_id, plaze_id, device, started_at, updated_at, status, "type", duration, permission, created_at, mac_address, subnet, msc
-> Append (cost=0.00..14117.35 rows=3378 width=622) (actual time=39.144..133.143 rows=38 loops=1)
-> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..7243.59 rows=2158 width=622) (actual time=39.138..109.831 rows=34 loops=1)
Index Cond: (user_id = 12345)
-> Nested Loop (cost=42.78..6839.98 rows=1220 width=622) (actual time=14.859..23.112 rows=4 loops=1)
-> HashAggregate (cost=42.78..42.79 rows=1 width=4) (actual time=8.092..8.095 rows=1 loops=1)
-> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=8.067..8.070 rows=1 loops=1)
Recheck Cond: (user_id = 12345)
-> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=8.057..8.057 rows=1 loops=1)
Index Cond: (user_id = 12345)
-> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..6768.48 rows=2297 width=622) (actual time=6.751..14.970 rows=4 loops=1)
Index Cond: (lt.user_id = relationships.contact_id)
Total runtime: 134.220 ms

Not so great for the user with many early matches:

testdb=# EXPLAIN ANALYZE SELECT * FROM (
SELECT * FROM large_table lt
WHERE lt.user_id = 55555
UNION
SELECT * FROM large_table lt
WHERE user_id IN (SELECT contact_id FROM relationships WHERE user_id=55555)
) q
ORDER BY created_at DESC LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=14673.77..14673.80 rows=10 width=3140) (actual time=3326.304..3326.367 rows=10 loops=1)
-> Sort (cost=14673.77..14682.22 rows=3378 width=3140) (actual time=3326.297..3326.318 rows=10 loops=1)
Sort Key: q.created_at
-> Unique (cost=14315.34..14442.01 rows=3378 width=622) (actual time=2413.070..3019.385 rows=69757 loops=1)
-> Sort (cost=14315.34..14323.78 rows=3378 width=622) (actual time=2413.062..2590.354 rows=69757 loops=1)
Sort Key: id, user_id, plaze_id, device, started_at, updated_at, status, "type", duration, permission, created_at, mac_address, subnet, msc
-> Append (cost=0.00..14117.35 rows=3378 width=622) (actual time=0.067..1911.626 rows=69757 loops=1)
-> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..7243.59 rows=2158 width=622) (actual time=0.062..3.440 rows=739 loops=1)
Index Cond: (user_id = 55555)
-> Nested Loop (cost=42.78..6839.98 rows=1220 width=622) (actual time=0.451..1557.901 rows=69018 loops=1)
-> HashAggregate (cost=42.78..42.79 rows=1 width=4) (actual time=0.404..0.580 rows=40 loops=1)
-> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=0.075..0.241 rows=40 loops=1)
Recheck Cond: (user_id = 55555)
-> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=0.053..0.053 rows=40 loops=1)
Index Cond: (user_id = 55555)
-> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..6768.48 rows=2297 width=622) (actual time=0.048..28.033 rows=1725 loops=40)
Index Cond: (lt.user_id = relationships.contact_id)
Total runtime: 3327.744 ms

> How about
>
> SELECT * FROM large_table lt
> WHERE lt.user_id = 12345 OR user_id IN (SELECT contact_id FROM
> relationships WHERE user_id=12345)
> ORDER BY created_at DESC LIMIT 10;

Not good for the one with few matches:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
WHERE lt.user_id = 12345 OR user_id IN (SELECT contact_id FROM
relationships WHERE user_id=12345)
ORDER BY created_at DESC LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=42.78..47.05 rows=10 width=622) (actual time=38360.090..62008.336 rows=10 loops=1)
-> Index Scan Backward using large_created_at_index on large_table lt (cost=42.78..935924.84 rows=2192498 width=622) (actual time=38360.084..62008.269 rows=10 loops=1)
Filter: ((user_id = 12345) OR (hashed subplan))
SubPlan
-> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=0.031..0.034 rows=1 loops=1)
Recheck Cond: (user_id = 12345)
-> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=0.020..0.020 rows=1 loops=1)
Index Cond: (user_id = 12345)
Total runtime: 62008.500 ms

Good for the one with many early matches:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
WHERE lt.user_id = 55555 OR user_id IN (SELECT contact_id FROM
relationships WHERE user_id=55555)
ORDER BY created_at DESC LIMIT 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=42.78..47.05 rows=10 width=622) (actual time=0.473..0.572 rows=10 loops=1)
-> Index Scan Backward using large_created_at_index on large_table lt (cost=42.78..935924.84 rows=2192498 width=622) (actual time=0.467..0.512 rows=10 loops=1)
Filter: ((user_id = 55555) OR (hashed subplan))
SubPlan
-> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=0.070..0.264 rows=40 loops=1)
Recheck Cond: (user_id = 55555)
-> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=0.047..0.047 rows=40 loops=1)
Index Cond: (user_id = 55555)
Total runtime: 0.710 ms

> I am missing a unique constraint on (user_id, contact_id) - otherwise
> the subselect is not equivalent to the join.
>
> It might be good to know whether contact_id = user_id is possible -
> since this would rule out the possibility of a row satisfying both
> branches of the union.

Thanks for the hint - this is a rails project with the validations
starting out in the application only, so sometimes we forget to also
check the data integrity in the database. There were in fact a couple
of duplicate user_id/contact_id pairs and a couple of rows where
user_id=contact_id although those shouldn't be allowed.

I added a unique index on (user_id, contact_id) and a check constraint
to prevent (user_id=contact_id), vacuumed, and ran the 4 above query
variants again - but it did not result in changed plans or substantial
differences in execution time.

> Probably you also should have foreign key constraints on
> relationships.user_id and relationships.contact_id. These are unlikely
> to affect performance though, in my experience.

They are there, I just removed them from the schema output before
posting, also assuming that they are not relevant for join
performance.

Til

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2007-07-28 14:37:07 Re: Vacuum looping?
Previous Message Steven Flatt 2007-07-27 21:32:11 Vacuum looping?