Seq Scan vs. Index Scan

From: Nassib Nassar <nassar(at)renci(dot)org>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: Seq Scan vs. Index Scan
Date: 2011-08-04 13:40:08
Message-ID: BD2648C0-836B-4F4C-834A-4B26CF092B20@renci.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

In this example it looks to me like the planner is choosing a Seq Scan resulting in 18x running time compared to running it with enable_seqscan = 'off'. Adding more indexes to public.gene (please see below) seemed to make things worse. I definitely have run VACUUM ANALYZE on everything, manually. What am I missing? Thank you for any feedback.

Query:
SELECT *
FROM gene_af_polyphen
WHERE dataset_id = '001-1' AND
(vartype = 'snp' OR
vartype = 'ins' OR
vartype = 'del' OR
vartype = 'sub');

Query plan:
http://explain.depesz.com/s/qnZ

Query plan after SET enable_seqscan TO 'off':
http://explain.depesz.com/s/N5q

Hardware:
24GB memory / 8 core running Linux 2.6.32 x86_64

Database configuration:
version | PostgreSQL 9.0.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit
autovacuum | off
default_transaction_isolation | serializable
effective_cache_size | 18GB
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
listen_addresses | *
max_connections | 100
max_stack_depth | 2MB
server_encoding | UTF8
shared_buffers | 6GB
TimeZone | US/Eastern
work_mem | 1GB

Table row counts:
public.gene ~1 billion (may have lots of NULLs in several columns)
public.af 38878319
public.polyphen 25821

Database:

Table "public.gene"
Column | Type | Modifiers
-----------------------+------------------------+-----------
dataset_id | character varying(255) |
referencename | character varying(255) |
index | integer |
locus | integer |
haplotype | integer |
chromosome | character varying(255) |
begincoord | integer |
endcoord | integer |
vartype | character varying(255) |
reference | character varying(255) |
call | character varying(255) |
xref | text |
geneid | integer |
mrnaacc | character varying(255) |
proteinacc | character varying(255) |
symbol | character varying(255) |
orientation | character(1) |
exoncategory | character varying(255) |
exon | integer |
codingregionknown | character(1) |
aacategory | character varying(255) |
nucleotidepos | character varying(255) |
proteinpos | character varying(255) |
aaannot | character varying(255) |
aacall | character varying(255) |
aaref | character varying(255) |
allele | character varying(255) |
component | character varying(255) |
componentindex | character varying(255) |
impact | character varying(255) |
annotationrefsequence | character varying(255) |
samplesequence | character varying(255) |
genomerefsequence | character varying(255) |
pfam | character varying(255) |
unknown1 | character varying(255) |
Indexes:
"gene_dataset_id_idx" btree (dataset_id), tablespace "indexspace"

Table "public.af"
Column | Type | Modifiers
-------------+------------------------+-----------
chromosome | character varying(255) | not null
endcoord | integer | not null
rs_id | character varying(255) |
reference | character varying(255) | not null
call | character varying(255) | not null
allele_freq | numeric |
Indexes:
"af_allele_freq_idx" btree (allele_freq), tablespace "indexspace"
"af_call_idx" btree (call), tablespace "indexspace"
"af_chromosome_idx" btree (chromosome), tablespace "indexspace"
"af_endcoord_idx" btree (endcoord), tablespace "indexspace"
"af_reference_idx" btree (reference), tablespace "indexspace"

Table "public.polyphen"
Column | Type | Modifiers
-------------------------+------------------------+-----------
mrnaacc | character varying(255) | not null
proteinpos | character varying(255) | not null
annotationrefsequence | character varying(255) | not null
samplesequence | character varying(255) | not null
prediction | character varying(255) |
probability_deleterious | numeric |
Indexes:
"polyphen_annotationrefsequence_idx1" btree (annotationrefsequence), tablespace "indexspace"
"polyphen_mrnaacc_idx1" btree (mrnaacc), tablespace "indexspace"
"polyphen_proteinpos_idx1" btree (proteinpos), tablespace "indexspace"
"polyphen_samplesequence_idx1" btree (samplesequence), tablespace "indexspace"

CREATE VIEW gene_af_polyphen AS
SELECT gene.dataset_id dataset_id,
gene.referencename referencename,
gene.index "index",
gene.locus locus,
gene.haplotype haplotype,
gene.chromosome chromosome,
gene.begincoord begincoord,
gene.endcoord endcoord,
gene.vartype vartype,
gene.reference reference,
gene.call call,
gene.xref xref,
gene.geneid geneid,
gene.mrnaacc mrnaacc,
gene.proteinacc proteinacc,
gene.symbol symbol,
gene.orientation orientation,
gene.exoncategory exoncategory,
gene.exon exon,
gene.codingregionknown codingregionknown,
gene.aacategory aacategory,
gene.nucleotidepos nucleotidepos,
gene.proteinpos proteinpos,
gene.aaannot aaannot,
gene.aacall aacall,
gene.aaref aaref,
gene.allele allele,
gene.component component,
gene.componentindex componentindex,
gene.impact impact,
gene.annotationrefsequence annotationrefsequence,
gene.samplesequence samplesequence,
gene.genomerefsequence genomerefsequence,
gene.pfam pfam,
gene.unknown1 unknown1,
af.rs_id rs_id,
af.allele_freq allele_freq,
polyphen.prediction prediction,
polyphen.probability_deleterious probability_deleterious
FROM gene
LEFT JOIN af
ON gene.chromosome = af.chromosome AND
gene.endcoord = af.endcoord AND
gene.reference = af.reference AND
gene.call = af.call
LEFT JOIN polyphen
ON gene.mrnaacc = polyphen.mrnaacc AND
gene.proteinpos = polyphen.proteinpos AND
gene.annotationrefsequence = polyphen.annotationrefsequence AND
gene.samplesequence = polyphen.samplesequence;

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Craig Ringer 2011-08-04 13:49:54 Re: Need to tune for Heavy Write
Previous Message Nicholson, Brad (Toronto, ON, CA) 2011-08-04 12:41:47 Re: Need to tune for Heavy Write