Re: Subquery flattening causing sequential scan

From: Jim Crate <jimcfl(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Subquery flattening causing sequential scan
Date: 2011-12-28 17:22:49
Message-ID: 4BCFFC7E-795D-4886-ABB0-D7628E583999@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Dec 27, 2011, at 1:12 PM, Tom Lane wrote:

> Jim Crate <jimcfl(at)gmail(dot)com> writes:
>> My question is why does it do a seq scan when it flattens this
>> subquery into a JOIN?
>
> Because it thinks there will be 3783 rows out of the msg scan, which if
> true would make your desired nestloop join a serious loser. You need to
> see about getting that estimate to be off by less than three orders of
> magnitude. Possibly raising the stats target on emsg_messages would
> help. I'd also try converting the inner NOT IN into a NOT EXISTS, just
> to see if that makes the estimate any better.

The planner does choose the nested loop after converting the NOT IN to NOT EXISTS. Using LEFT JOIN / IS NULL also generated the same plan as NOT EXISTS. I guess I really need to learn more about reading explain plans, and expand my use of different constructs. It's so easy to fall into the trap of using the same construct in all situations just because it works well enough most of the time and is easy to read.

As for default_statistics_target, I read the docs and I'm not sure how increasing that value would help in this case. There are only a couple hundred accounts, and less than 5 values for message_type and spam_level. In the emsg_message_folders table, the message_id is considered unique (pg_stats has n_distinct = -1), which would also be correct.

EXPLAIN ANALYZE
SELECT ema.message_id, ema.email_address_id, ema.address_type
FROM emsg_message_addresses ema
WHERE ema.message_id IN (
SELECT id
FROM emsg_messages msg
WHERE msg.account_id = 314 AND msg.outgoing = FALSE
AND msg.message_type = 1 AND msg.spam_level < 2
AND msg.deleted_at IS NULL
AND NOT EXISTS (
SELECT emf.message_id
FROM emsg_message_folders emf
WHERE emf.account_id = 314 AND emf.message_id = msg.id
)
)

QUERY PLAN
Nested Loop (cost=84785.80..84806.43 rows=100455 width=12) (actual time=262.507..262.528 rows=6 loops=1)
-> HashAggregate (cost=84785.80..84785.81 rows=1 width=4) (actual time=262.445..262.446 rows=3 loops=1)
-> Hash Anti Join (cost=8285.87..84785.80 rows=1 width=4) (actual time=254.363..262.426 rows=3 loops=1)
Hash Cond: (msg.id = emf.message_id)
-> Bitmap Heap Scan on emsg_messages msg (cost=869.66..77274.56 rows=7602 width=4) (actual time=13.622..204.879 rows=12387 loops=1)
Recheck Cond: (account_id = 314)
Filter: ((NOT outgoing) AND (deleted_at IS NULL) AND (spam_level < 2) AND (message_type = 1))
-> Bitmap Index Scan on index_emsg_messages_on_account_id (cost=0.00..867.76 rows=34582 width=0) (actual time=8.756..8.756 rows=35091 loops=1)
Index Cond: (account_id = 314)
-> Hash (cost=6990.69..6990.69 rows=34042 width=4) (actual time=45.785..45.785 rows=34647 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 1219kB
-> Bitmap Heap Scan on emsg_message_folders emf (cost=680.16..6990.69 rows=34042 width=4) (actual time=5.465..35.842 rows=34647 loops=1)
Recheck Cond: (account_id = 314)
-> Bitmap Index Scan on index_emsg_message_folders_on_account_id (cost=0.00..671.65 rows=34042 width=0) (actual time=4.966..4.966 rows=34647 loops=1)
Index Cond: (account_id = 314)
-> Index Scan using index_emsg_message_addresses_on_message_id on emsg_message_addresses ema (cost=0.00..20.45 rows=13 width=12) (actual time=0.023..0.023 rows=2 loops=3)
Index Cond: (ema.message_id = msg.id)
Total runtime: 262.742 ms

Jim Crate

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Matteo Beccati 2011-12-28 18:02:36 Re: PostgreSQL 9.0.4 blocking in lseek?
Previous Message Matteo Beccati 2011-12-28 14:20:07 Re: PostgreSQL 9.0.4 blocking in lseek?