Quick Links

Performance Anomaly with "col in (A, B)" vs. "col = A OR col = B" ver. 9.0.3

From:	Timothy Garnett <tgarnett(at)panjiva(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Performance Anomaly with "col in (A, B)" vs. "col = A OR col = B" ver. 9.0.3
Date:	2011-09-23 22:37:22
Message-ID:	CAPcyiQ1U1k=mcpKT2ecy02Vr=2Eqpee3tnQLpfvW-aViSp=Big@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi All,

We are currently using PostgreSQL 9.0.3 and we noticed a performance anomaly
from a framework (ActiveRecord) generated query to one of our tables. The
query uses an in clause to check an indexed column for the presence of
either of two values. In this particular case neither of them is present
(but in other cases one or more might be). The framework generates a limit
1 query to test for existence. This query ends up using a seq scan and is
quite slow, however rewriting it using OR = rather then IN uses the index
(as does removing the limit or raising it to a large value). The table has
36 million rows (more details are below) and is read only in typical usage.
I was wondering if IN vs OR planning being so differently represented a bug
and/or if we might have some misconfiguration somewhere that leads the query
planner to pick what in best case can only be a slightly faster plan then
using the index but in worst case is much much slower. I would also think
the cluster on the table would argue against using a sequence scan for this
kind of query (since the hts_code_id's will be colocated, perf, if the id is
present, will very greatly depending on what order the seq scan walks the
table which we've observed...; if the id(s) are not present then this plan
is always terrible). We can use set enable_seqscan TO off around this query
if need be, but it seems like something the planner should have done better
with unless we have something weird somewhere (conf file details are below).

psql (9.0.3)
Type "help" for help.

-- Table info
dev=> ANALYZE exp_detls;
ANALYZE
dev=> select count(*) from exp_detls;
36034391
dev=>explain analyze select count(*) from exp_detls;
Aggregate (cost=1336141.30..1336141.31 rows=1 width=0) (actual
time=43067.620..43067.621 rows=1 loops=1)
-> Seq Scan on exp_detls (cost=0.00..1246046.84 rows=36037784 width=0)
(actual time=0.011..23703.177 rows=36034391 loops=1)
Total runtime: 43067.675 ms
dev=>select pg_size_pretty(pg_table_size('exp_detls'));
6919 MB

-- Problematic Query
dev=> explain analyze SELECT "exp_detls".id FROM "exp_detls" WHERE
("exp_detls"."hts_code_id" IN (12,654)) LIMIT 1;
Limit (cost=0.00..158.18 rows=1 width=4) (actual time=9661.363..9661.363
rows=0 loops=1)
-> Seq Scan on exp_detls (cost=0.00..1336181.90 rows=8447 width=4)
(actual time=9661.360..9661.360 rows=0 loops=1)
Filter: (hts_code_id = ANY ('{12,654}'::integer[]))
Total runtime: 9661.398 ms
(4 rows)

-- Using OR =, much faster, though more complicated plan then below
dev=> explain analyze SELECT "exp_detls".id FROM "exp_detls" WHERE
("exp_detls"."hts_code_id" = 12 OR "exp_detls"."hts_code_id" = 654) LIMIT 1;
Limit (cost=162.59..166.29 rows=1 width=4) (actual time=0.029..0.029
rows=0 loops=1)
-> Bitmap Heap Scan on exp_detls (cost=162.59..31188.14 rows=8370
width=4) (actual time=0.028..0.028 rows=0 loops=1)
Recheck Cond: ((hts_code_id = 12) OR (hts_code_id = 654))
-> BitmapOr (cost=162.59..162.59 rows=8370 width=0) (actual
time=0.027..0.027 rows=0 loops=1)
-> Bitmap Index Scan on
index_exp_detls_on_hts_code_id_and_data_month (cost=0.00..79.20 rows=4185
width=0) (actual time=0.017..0.017 rows=0 loops=1)
Index Cond: (hts_code_id = 12)
-> Bitmap Index Scan on
index_exp_detls_on_hts_code_id_and_data_month (cost=0.00..79.20 rows=4185
width=0) (actual time=0.007..0.007 rows=0 loops=1)
Index Cond: (hts_code_id = 654)
Total runtime: 0.051 ms
(9 rows)

-- No limit, much faster, also a cleaner looking plan (of course problematic
when there are many matching rows)
dev=>explain analyze SELECT "exp_detls".id FROM "exp_detls" WHERE
("exp_detls"."hts_code_id" IN (12,654));
Bitmap Heap Scan on exp_detls (cost=156.93..31161.56 rows=8370 width=4)
(actual time=0.028..0.028 rows=0 loops=1)
Recheck Cond: (hts_code_id = ANY ('{12,654}'::integer[]))
-> Bitmap Index Scan on index_exp_detls_on_hts_code_id_and_data_month
(cost=0.00..154.84 rows=8370 width=0) (actual time=0.026..0.026 rows=0
loops=1)
Index Cond: (hts_code_id = ANY ('{12,654}'::integer[]))
Total runtime: 0.045 ms
(5 rows)

-- Table Schema

postgresql.conf non-default settings
listen_addresses = '*' # what IP address(es) to listen on;
port = 5432 # (change requires restart)
max_connections = 230 # (change requires restart)
tcp_keepalives_idle = 180 # TCP_KEEPIDLE, in seconds;
shared_buffers = 4GB # min 128kB, DEFAULT 32MB
work_mem = 512MB # min 64kB, DEFAULT 1MB
maintenance_work_mem = 256MB # min 1MB, DEFAULT 16MB
effective_io_concurrency = 2 # 1-1000. 0 disables prefetching
synchronous_commit = off # immediate fsync at commit, DEFAULT on
wal_buffers = 16MB # min 32kB, DEFAULT 64kB
wal_writer_delay = 330ms # 1-10000 milliseconds, DEFAULT 200ms
checkpoint_segments = 24 # in logfile segments, min 1, 16MB each
checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0
effective_cache_size = 24GB # DEFAULT 128MB
logging_collector = on # Enable capturing of stderr and csvlog
log_checkpoints = on # DEFAULT off
log_connections = on # DEFAULT off
log_disconnections = on # DEFAULT off
log_hostname = on # DEFAULT off
log_line_prefix = '%t' # special values:
track_activity_query_size = 8192 # (change requires restart)
bytea_output = 'escape' # hex, escape, Default hex
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'

Responses

Re: Performance Anomaly with "col in (A, B)" vs. "col = A OR col = B" ver. 9.0.3 at 2011-09-26 17:07:51 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Filip Rembiałkowski	2011-09-24 06:10:13	Re: slow query on tables with new columns added.
Previous Message	Antonio Rodriges	2011-09-23 20:03:57	[PERFORMANCE] Insights: fseek OR read_cluster?