Quick Links

Re: Optimization idea

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Optimization idea
Date:	2010-04-26 09:33:29
Message-ID:	x2te94e14cd1004260233i45c44966qc4a8b10f4edb2963@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

2010/4/26 Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru>:
>
>> On Thu, Apr 22, 2010 at 10:37 PM, Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru>
>> wrote:
>>
>>>
>>> I don't think this is just an issue with statistics, because the same
>>> problem arises when I try executing a query like this:
>>>
>>
>> I'm not sure how you think this proves that it isn't a problem with
>> statistics, but I think what you should be focusing on here, looking
>> back to your original email, is that the plans that are actually much
>> faster have almost as much estimated cost as the slower one. Since
>> all your data is probably fully cached, at a first cut, I might try
>> setting random_page_cost and seq_page_cost to 0.005 or so, and
>> adjusting effective_cache_size to something appropriate.
>>
>> ...Robert
>>
>>
>
> Ok. I thougth it's quite obvious because of these two queries. I can't
> understand why the estimated rows count is 40040 in the first plan.

In the first query, the planner doesn't use the information of the 2,3,4.
It just does a : I'll bet I'll have 2 rows in t1 (I think it should
say 3, but it doesn't)
So it divide the estimated number of rows in the t2 table by 5
(different values) and multiply by 2 (rows) : 40040.

In the second query the planner use a different behavior : it did
expand the value of t1.t to t2.t for each join relation and find a
costless plan. (than the one using seqscan on t2)

We are here in corner case situation where n_distinc valuest <
statistics on the column and where we might be able to improve the
planner decision. I believe I have already read something on this
topic on -hackers...

>
> test=# explain analyze select * from t2 join t1 on t1.t = t2.t where
> t1.t in (2,3,4);
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=1.09..2319.87 rows=40040 width=32) (actual
> time=0.050..356.269 rows=400 loops=1)
> Hash Cond: (t2.t = t1.t)
> -> Seq Scan on t2 (cost=0.00..1543.00 rows=100100 width=16) (actual
> time=0.013..176.087 rows=100100 loops=1)
> -> Hash (cost=1.07..1.07 rows=2 width=16) (actual time=0.023..0.023
> rows=3 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 1kB
> -> Seq Scan on t1 (cost=0.00..1.07 rows=2 width=16) (actual
> time=0.006..0.014 rows=3 loops=1)
> Filter: (t = ANY ('{2,3,4}'::bigint[]))
> Total runtime: 356.971 ms
> (8 rows)
>
> test=# explain analyze select * from t2 join t1 on t1.t = t2.t where
> t1.t = 2 union all select * from t2 join t1 on t1.t = t2.t where t1.t =
> 3 union all select * from t2 join t1 on t1.t = t2.t where t1.t = 4;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------
> Append (cost=0.00..112.42 rows=407 width=32) (actual time=0.048..3.487
> rows=400 loops=1)
> -> Nested Loop (cost=0.00..47.51 rows=197 width=32) (actual
> time=0.045..1.061 rows=200 loops=1)
> -> Seq Scan on t1 (cost=0.00..1.06 rows=1 width=16) (actual
> time=0.011..0.014 rows=1 loops=1)
> Filter: (t = 2)
> -> Index Scan using t_idx on t2 (cost=0.00..44.48 rows=197
> width=16) (actual time=0.026..0.382 rows=200 loops=1)
> Index Cond: (pg_temp_2.t2.t = 2)
> -> Nested Loop (cost=0.00..32.67 rows=117 width=32) (actual
> time=0.019..0.599 rows=100 loops=1)
> -> Seq Scan on t1 (cost=0.00..1.06 rows=1 width=16) (actual
> time=0.003..0.006 rows=1 loops=1)
> Filter: (t = 3)
> -> Index Scan using t_idx on t2 (cost=0.00..30.43 rows=117
> width=16) (actual time=0.010..0.211 rows=100 loops=1)
> Index Cond: (pg_temp_2.t2.t = 3)
> -> Nested Loop (cost=0.00..28.17 rows=93 width=32) (actual
> time=0.017..0.534 rows=100 loops=1)
> -> Seq Scan on t1 (cost=0.00..1.06 rows=1 width=16) (actual
> time=0.005..0.008 rows=1 loops=1)
> Filter: (t = 4)
> -> Index Scan using t_idx on t2 (cost=0.00..26.18 rows=93
> width=16) (actual time=0.007..0.187 rows=100 loops=1)
> Index Cond: (pg_temp_2.t2.t = 4)
> Total runtime: 4.190 ms
> (17 rows)
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--
Cédric Villemain

In response to

Re: Optimization idea at 2010-04-26 02:52:23 from Vlad Arkhipov

Responses

Re: Optimization idea at 2010-04-28 00:46:36 from Robert Haas

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tom Lane	2010-04-26 16:51:32	Re: Planner issue on sorting joining of two tables with limit
Previous Message	Vlad Arkhipov	2010-04-26 02:52:23	Re: Optimization idea