need to repeat the same condition on joined tables in order to choose the proper plan

From: Svetlin Manavski <svetlin(dot)manavski(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: need to repeat the same condition on joined tables in order to choose the proper plan
Date: 2011-06-14 13:55:26
Message-ID: BANLkTikSqgoucZkO0F3v8pHCopXjR=+NYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi performance gurus,

One of the reasons I prefer PostgreSQL is because it does not implement
hints. However I have a situation which seems like I am forced to use a
hint-like statement:

I have 2 tables in PostgreSQL 9.0:

tcpsessions - about 4 Mrows in lab, hundreds of Mrows in production
primary key(detectorid, id)

tcpsessiondata - about 2-5 times bigger than tcpsessions
Foreign key(detectorid, sessionid) References tcpsessions(detectorid,id)
There is an index on (detectorid, sessionid)

For completeness tcpsessiondata is actually partitioned according to the
official documentation but I will save you the details if that is not
necessary. For the purpose of this message, all the data will be available
in one child table: tcpsessiondata_default

When I run the following simple query:

select
(createdtime / 60000000000) as timegroup,
(sum(datafromsource)+sum(datafromdestination)) as numbytes,
(sum(packetsfromsource)+sum(packetsfromdestination)) as numpackets
from
tcpsessiondata SD, tcpsessions SS
where
SD.detectorid = SS.detectorid
and SD.sessionid = SS.id
and SD.detectorid = 1
and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <=
2001000000000::INT8
group by
timegroup
order by
timegroup asc

I get the following plan:
"Sort (cost=259126.13..259126.63 rows=200 width=32) (actual
time=32526.762..32526.781 rows=20 loops=1)"
" Output: ((sd.createdtime / 60000000000::bigint)),
(((sum(sd.datafromsource) + sum(sd.datafromdestination)) /
1048576::numeric)), ((sum(sd.packetsfromsource) +
sum(sd.packetsfromdestination)))"
" Sort Key: ((sd.createdtime / 60000000000::bigint))"
" Sort Method: quicksort Memory: 26kB"
" -> HashAggregate (cost=259112.49..259118.49 rows=200 width=32) (actual
time=32526.657..32526.700 rows=20 loops=1)"
" Output: ((sd.createdtime / 60000000000::bigint)),
((sum(sd.datafromsource) + sum(sd.datafromdestination)) / 1048576::numeric),
(sum(sd.packetsfromsource) + sum(sd.packetsfromdestination))"
" -> Hash Join (cost=126553.43..252603.29 rows=520736 width=32)
(actual time=22400.430..31291.838 rows=570100 loops=1)"
" Output: sd.createdtime, sd.datafromsource,
sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination,
(sd.createdtime / 60000000000::bigint)"
" Hash Cond: (sd.sessionid = ss.id)"
" -> Append (cost=0.00..100246.89 rows=520736 width=42)
(actual time=2382.160..6226.906 rows=570100 loops=1)"
" -> Seq Scan on appqosdata.tcpsessiondata sd
(cost=0.00..18.65 rows=1 width=42) (actual time=0.002..0.002 rows=0
loops=1)"
" Output: sd.createdtime, sd.datafromsource,
sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination,
sd.detectorid, sd.sessionid"
" Filter: ((sd.sessionid >= 1001000000000::bigint)
AND (sd.sessionid <= 2001000000000::bigint) AND (sd.detectorid = 1))"
" -> Bitmap Heap Scan on
appqosdata.tcpsessiondata_default sd (cost=11001.37..100228.24 rows=520735
width=42) (actual time=2382.154..5278.319 rows=570100 loops=1)"
" Output: sd.createdtime, sd.datafromsource,
sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination,
sd.detectorid, sd.sessionid"
" Recheck Cond: ((sd.detectorid = 1) AND
(sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <=
2001000000000::bigint))"
" -> Bitmap Index Scan on
idx_tcpsessiondata_default_detectoridandsessionid (cost=0.00..10871.19
rows=520735 width=0) (actual time=2351.865..2351.865 rows=574663 loops=1)"
" Index Cond: ((sd.detectorid = 1) AND
(sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <=
2001000000000::bigint))"
" -> Hash (cost=72340.48..72340.48 rows=3628356 width=10)
(actual time=19878.891..19878.891 rows=3632586 loops=1)"
" Output: ss.detectorid, ss.id"
" Buckets: 8192 Batches: 64 Memory Usage: 2687kB"
" -> Seq Scan on appqosdata.tcpsessions ss
(cost=0.00..72340.48 rows=3628356 width=10) (actual time=627.164..14586.202
rows=3632586 loops=1)"
" Output: ss.detectorid, ss.id"
" Filter: (ss.detectorid = 1)"
"Total runtime: 32543.224 ms"

As we can see the planner decides to go for an index scan on
tcpsessiondata_default (as expected) and for a seq scan on tcpsessions.
However if I add the following ugly condition to my query:
and SS.detectorid = 1
and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8
so that the full query now becomes:

select
(createdtime / 60000000000) as timegroup,
(sum(datafromsource)+sum(datafromdestination)) as numbytes,
(sum(packetsfromsource)+sum(packetsfromdestination)) as numpackets
from
tcpsessiondata SD, tcpsessions SS
where
SD.detectorid = SS.detectorid
and SD.sessionid = SS.id
and SD.detectorid = 1
and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <=
2001000000000::INT8
and SS.detectorid = 1
and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8
group by
timegroup
order by
timegroup asc

well, now I have an index scan on tcpsessions as well and running time is 3
times less than the previous one:

"Sort (cost=157312.59..157313.09 rows=200 width=32) (actual
time=9682.748..9682.764 rows=20 loops=1)"
" Output: ((sd.createdtime / 60000000000::bigint)),
(((sum(sd.datafromsource) + sum(sd.datafromdestination)) /
1048576::numeric)), ((sum(sd.packetsfromsource) +
sum(sd.packetsfromdestination)))"
" Sort Key: ((sd.createdtime / 60000000000::bigint))"
" Sort Method: quicksort Memory: 26kB"
" -> HashAggregate (cost=157298.94..157304.94 rows=200 width=32) (actual
time=9682.649..9682.692 rows=20 loops=1)"
" Output: ((sd.createdtime / 60000000000::bigint)),
((sum(sd.datafromsource) + sum(sd.datafromdestination)) / 1048576::numeric),
(sum(sd.packetsfromsource) + sum(sd.packetsfromdestination))"
" -> Hash Join (cost=32934.67..150744.28 rows=524373 width=32)
(actual time=3695.016..8370.629 rows=570100 loops=1)"
" Output: sd.createdtime, sd.datafromsource,
sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination,
(sd.createdtime / 60000000000::bigint)"
" Hash Cond: (sd.sessionid = ss.id)"
" -> Append (cost=0.00..100948.71 rows=524373 width=42)
(actual time=2318.568..4799.985 rows=570100 loops=1)"
" -> Seq Scan on appqosdata.tcpsessiondata sd
(cost=0.00..18.65 rows=1 width=42) (actual time=0.001..0.001 rows=0
loops=1)"
" Output: sd.createdtime, sd.datafromsource,
sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination,
sd.detectorid, sd.sessionid"
" Filter: ((sd.sessionid >= 1001000000000::bigint)
AND (sd.sessionid <= 2001000000000::bigint) AND (sd.detectorid = 1))"
" -> Bitmap Heap Scan on
appqosdata.tcpsessiondata_default sd (cost=11080.05..100930.06 rows=524372
width=42) (actual time=2318.563..3789.844 rows=570100 loops=1)"
" Output: sd.createdtime, sd.datafromsource,
sd.datafromdestination, sd.packetsfromsource, sd.packetsfromdestination,
sd.detectorid, sd.sessionid"
" Recheck Cond: ((sd.detectorid = 1) AND
(sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <=
2001000000000::bigint))"
" -> Bitmap Index Scan on
idx_tcpsessiondata_default_detectoridandsessionid (cost=0.00..10948.96
rows=524372 width=0) (actual time=2305.322..2305.322 rows=574663 loops=1)"
" Index Cond: ((sd.detectorid = 1) AND
(sd.sessionid >= 1001000000000::bigint) AND (sd.sessionid <=
2001000000000::bigint))"
" -> Hash (cost=30159.60..30159.60 rows=185726 width=10)
(actual time=1345.307..1345.307 rows=194971 loops=1)"
" Output: ss.detectorid, ss.id"
" Buckets: 8192 Batches: 4 Memory Usage: 2297kB"
" -> Bitmap Heap Scan on appqosdata.tcpsessions ss
(cost=3407.46..30159.60 rows=185726 width=10) (actual time=483.572..1069.292
rows=194971 loops=1)"
" Output: ss.detectorid, ss.id"
" Recheck Cond: ((ss.id >= 1001000000000::bigint)
AND (ss.id <= 2001000000000::bigint))"
" Filter: (ss.detectorid = 1)"
" -> Bitmap Index Scan on idx_tcpsessions_id
(cost=0.00..3361.02 rows=201751 width=0) (actual time=451.242..451.242
rows=219103 loops=1)"
" Index Cond: ((ss.id >=
1001000000000::bigint) AND (ss.id <= 2001000000000::bigint))"
"Total runtime: 9682.905 ms"

Let me also add that if I remove the conditions on SD but keep the
conditions on SS, then I get an index scan on tcpsessions BUT a seq scan on
tcpsessiondata.

Let's now suppose that the index scan on both tables is the best choice as
the planner itself selects it in one of the 3 cases. (It is also the faster
plan as we extract only 200 000 rows out of 4 millions in this example). But
I am really surprised to see that the planner needs me to explicitly specify
the same condition twice like this:

SD.detectorid = SS.detectorid
and SD.sessionid = SS.id
and SD.detectorid = 1
and SD.sessionid >= 1001000000000::INT8 and SD.sessionid <=
2001000000000::INT8
and SS.detectorid = 1
and SS.id >= 1001000000000::INT8 and SS.id <= 2001000000000::INT8

in order to use the primary key on SS, even if it is absolutely clear that
"SD.detectorid = SS.detectorid and SD.sessionid = SS.id". Well I hope you
agree that repeating the same condition on SS seems very like giving a hint
to use the index there. But I feel very uncomfortable to use such an ugly
condition, especially knowing that I am doing it to "force an index". On the
other hand I am terrified that we may go in production for a seq scan on
hundreds of millions of rows just to extract 200 000.

Would you please explain that behavior and how would you suggest to proceed?

Thanks for any comments,
Svetlin Manavski

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2011-06-14 16:29:05 Re: need to repeat the same condition on joined tables in order to choose the proper plan
Previous Message Willy-Bas Loos 2011-06-13 22:33:59 Re: [PERFORM] change sample size for statistics