Skip site navigation (1) Skip section navigation (2)

Re: Inefficient query plan

From: Jann Röder <roederja(at)ethz(dot)ch>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Inefficient query plan
Date: 2010-08-24 13:24:30
Message-ID: i50h6e$4t1$1@dough.gmane.org (view raw or flat)
Thread:
Lists: pgsql-performance
Thanks everyone,
the problem just solved itself. After the ANALYZE had finished, postgres
started doing what I wanted it to do all along:
EXPLAIN SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE
p.itemID = r.ItemID AND p.issueID = 'A1983PW823';

"Nested Loop  (cost=0.00..4515980.97 rows=2071811 width=16)"
"  ->  Index Scan using idx_papers_issueid on papers p
(cost=0.00..274.53 rows=508 width=16)"
"        Index Cond: (issueid = 'A1983PW823'::bpchar)"
"  ->  Index Scan using paperreferences_pkey on paperreferences r
(cost=0.00..8838.21 rows=4078 width=16)"
"        Index Cond: (r.itemid = p.itemid)"

So thanks again. I'm starting to grasp the postgres quirks :)

Jann

Am 24.08.10 15:03, schrieb Jann Röder:
> So that took a while... I'm currently running ANALYZE on the
> PaperReferences table again (the one where I changed the data type).
> 
> The plan however is still the same:
> "Hash Join  (cost=280.88..24330800.08 rows=670602240 width=16)"
> "  Hash Cond: (r.itemid = p.itemid)"
> "  ->  Seq Scan on paperreferences r  (cost=0.00..15109738.40
> rows=670602240 width=64)"
> "  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
> "        ->  Index Scan using idx_papers_issueid on papers p
> (cost=0.00..274.53 rows=508 width=16)"
> "              Index Cond: (issueid = 'A1983PW823'::bpchar)"
> 
> But I can now force it to use an index scan instead of a seqScan:
> "Merge Join  (cost=0.00..2716711476.57 rows=670602240 width=16)"
> "  Merge Cond: (p.itemid = r.itemid)"
> "  ->  Index Scan using papers_pkey on papers p  (cost=0.00..21335008.47
> rows=508 width=16)"
> "        Filter: (issueid = 'A1983PW823'::bpchar)"
> "  ->  Index Scan using paperreferences_pkey on paperreferences r
> (cost=0.00..2686993938.83 rows=670602240 width=64)"
> 
> Unfortunately this is not faster than the other one. I did not wait
> until it returned because I want this query to take less than 5 seconds
> or so.
> 
> Here is my query again:
> SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID =
> r.ItemID AND p.issueID = 'A1983PW823';
> 
> I can also write it as:
> SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM
> Papers WHERE IssueID = 'A1983PW823')
> 
> Which is more what I would do if I was the database. Unfortunately this
> is not fast either:
> 
> "Hash Semi Join  (cost=280.88..24330800.08 rows=670602240 width=64)"
> "  Hash Cond: (paperreferences.itemid = papers.itemid)"
> "  ->  Seq Scan on paperreferences  (cost=0.00..15109738.40
> rows=670602240 width=64)"
> "  ->  Hash  (cost=274.53..274.53 rows=508 width=16)"
> "        ->  Index Scan using idx_papers_issueid on papers
> (cost=0.00..274.53 rows=508 width=16)"
> "              Index Cond: (issueid = 'A1983PW823'::bpchar)"
> 
> The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is
> really fast, though and returns 16 rows. If I unroll the query by hand
> like this:
> SELECT ItemID FROM PaperReferences WHERE
> (ItemID = 'A1983PW82300001' OR
> ItemID = 'A1983PW82300002' OR
> ItemID = 'A1983PW82300003' OR
> ItemID = 'A1983PW82300004' OR
> ItemID = 'A1983PW82300005' OR
> ItemID = 'A1983PW82300006' OR
> ...)
> 
> (All the ORed stuff is the result of the sub-query) I get my result
> really fast. So what I need now is a way to tell postgres to do it that
> way automatically. If everything else fails I will have to put that
> logic into my application in java code, which I don't want to do because
> then I will also have to remove my constraints so I can delete stuff at
> a reasonable speed.
> 
> Thanks,
> Jann
> 
> 
> Am 23.08.10 15:33, schrieb Kevin Grittner:
>> Jann Röder<roederja(at)ethz(dot)ch> wrote:
>>  
>>>                  Table "public.papers"
>>>       Column      |          Type           | Modifiers
>>> ------------------+-------------------------+-----------
>>>  itemid           | character(15)           | not null
>>  
>>> wos-db=> \d PaperReferences
>>>              Table "public.paperreferences"
>>>        Column       |         Type          | Modifiers
>>> --------------------+-----------------------+-----------
>>>  itemid             | character varying(15) | not null
>>  
>>> I just noticed that PaperReferences uses character varying (15)
>>> and Papers uses character(15). Stupid mistake of mine. Do you
>>> think this might cause the bad query planning?
>>  
>> Absolutely.  These are *not* the same type and don't compare all
>> that well.
>>  
>>> I will alter the table to use character(15) in both cases and see
>>> if that helps.
>>  
>> I suspect that making them the same will cure the problem, but I
>> would recommend you make any character(n) columns character
>> varying(n) instead of the other way around.  The the character(n)
>> data type has many surprising behaviors and tends to perform worse. 
>> Avoid using it if possible.
>>  
>>> postgresql.conf:
>>> max_connections = 20			
>>> shared_buffers = 256MB			
>>> work_mem = 10MB				
>>> maintenance_work_mem = 128MB		
>>> max_stack_depth = 4MB			
>>> synchronous_commit = off		
>>> wal_buffers = 1MB			
>>> checkpoint_segments = 10		
>>> effective_cache_size = 768MB
>>> default_statistics_target = 200	
>>> datestyle = 'iso, mdy'
>>> lc_messages = 'C'			
>>> lc_monetary = 'C'			
>>> lc_numeric = 'C'			
>>> lc_time = 'C'				
>>> default_text_search_config = 'pg_catalog.simple'
>>  
>>> Do you need an EXPLAIN ANALYZE output? Since it takes so long I
>>> can't easily post one right now. But maybe I can get one over
>>> night.
>>  
>> Not necessary; you've already identified the cause and the fix.
>>  
>>> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a
>>> 2.1 GHz (or so) core 2 Duo processor.
>>  
>> OK.  If you still don't get a good plan, you might want to try
>> edging up effective_cache_size, if the sum of your shared_buffers
>> and OS cache is larger than 768MB (which I would expect it might
>> be).  If the active part of your database (the part which is
>> frequently referenced) fits within cache space, or even a
>> significant portion of it fits, you might need to adjust
>> random_page_cost and perhaps seq_page_cost to reflect the lower
>> average cost of fetching from cache rather than disk -- but you want
>> to fix your big problem (the type mismatch) first, and then see if
>> you need further adjustments.
>>  
>> -Kevin



In response to

pgsql-performance by date

Next:From: Guillaume LelargeDate: 2010-08-25 06:58:59
Subject: Re: Are Indices automatically generated for primary keys?
Previous:From: Jann RöderDate: 2010-08-24 13:03:16
Subject: Re: Inefficient query plan

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group