Quick Links

Re: Select performance variation based on the different combinations of using where lower(), order by, and limit

From:	Tyler Reese <jukey91(at)gmail(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Select performance variation based on the different combinations of using where lower(), order by, and limit
Date:	2013-08-19 04:12:55
Message-ID:	CAFvRLydnqc2kg=KJskzy7VMADDaAgTM9YDqFumpc0nROFCEHLQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Sun, Aug 18, 2013 at 10:33 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> Tyler Reese <jukey91(at)gmail(dot)com> wrote:
> > Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> >> Tyler Reese <jukey91(at)gmail(dot)com> wrote:
>
> >>> mydb=> explain analyze SELECT * FROM "cdr" WHERE
> lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
> >>>
> >>> Limit (cost=0.00..72882.05 rows=100 width=757) (actual
> time=20481.083..30464.960 rows=11 loops=1)
> >>> -> Index Scan using cdr_pkey on cdr (cost=0.00..30036152.32
> rows=41212 width=757) (actual time=20481.049..30464.686 rows=11 loops=1)
> >>> Filter: (lower("CallingPartyNumber") = '9725551212'::text)
> >>> Total runtime: 30465.246 ms
> >>
> >> It thinks that it will only need to read 1/412th of the table to
> >> find 100 matching rows, and using that index it will be able to
> >> skip the sort. Since there aren't 100 matching rows, it has to
> >> read the whole table through the index.
>
> > So, since it thinks it needs to read 1/412th of the table is the
> > reason why the query planner chooses to use the primary key index
> > instead of the callingpartynumber index, like it does in the
> > first 3 cases?
>
> The optimizer compares the estimated cost of reading all matching
> rows (but only matching rows) out of order and then sorting them to
> the estimated cost of reading them in order and filtering out the
> rows that don't match (and stopping when the limit is reached).
> Since it though a lot of rows would match, that made the sort look
> more expensive and also like it would not reed to read a very large
> percentage of the table.
>
> > I'm curious as to why it says "rows=41212". Is that the estimate
> > of the number of rows that meet the filter condition? Where does
> > that come from?
>
> That's based on the distribution observed in the last random sample
> when ANALYZE was run (as a command or by autovacuum). When there
> is an error in the estimate that bad, either autovacuum is not
> configured to be aggressive enough in analyzing or the stample size
> was not large enough.
>
> >> Raising the statistics target and running ANALYZE might allow
> >> it to use a more accurate estimate, and thereby make a better
> >> choice.
> >
> > I haven't heard of raising the statistics target, so I'll read up
> > on that.
>
>
> http://www.postgresql.org/docs/9.2/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER
>
> http://www.postgresql.org/docs/9.2/interactive/sql-altertable.html
>
> > A few days ago, all 4 cases were responding equally fast. I had
> > been messing around with the postgres settings, and I went and
> > dropped all of the indexes and recreated them just to see what
> > would happen. I wouldn't think that recreating the indexes would
> > cause case 4 to go slow, but that's the symptom I am seeing now.
> > Should I be running analyze on a table after it has been
> > reindexed?
>
> Only if you have indexes on expressions rather than simple column
> names. If you have an index on lower("CallingPartyNumber") that
> would qualify as an expression.
>
> The normal reason to need to get fresh statistics is because of
> changes in the distribution of values in a column, particularly
> after a bulk load. Also, columns with a large number of distinct
> values tend to benefit from a higher statistics target. The down
> sides of higher statistics targets are a longer time to ANALYZE and
> increased planning time; so it's generally best to use the default
> except where a particular problem has been observed, like in this
> case.
>

I performed ANALYZE on the table and now case 4 is fast again:

mydb=> explain analyze SELECT * FROM "cdr" WHERE
lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=958.72..958.97 rows=100 width=758) (actual time=0.251..0.310
rows=11 loops=1)
-> Sort (cost=958.72..959.32 rows=240 width=758) (actual
time=0.244..0.266 rows=11 loops=1)
Sort Key: key
Sort Method: quicksort Memory: 30kB
-> Bitmap Heap Scan on cdr (cost=7.30..949.55 rows=240
width=758) (actual time=0.105..0.162 rows=11 loops=1)
Recheck Cond: (lower("CallingPartyNumber") =
'9725551212'::text)
-> Bitmap Index Scan on callingpartynumber_lower
(cost=0.00..7.24 rows=240 width=0) (actual time=0.085..0.085 rows=11
loops=1)
Index Cond: (lower("CallingPartyNumber") =
'9725551212'::text)
Total runtime: 0.517 ms
(9 rows)

Thanks for the help, Kevin.

In response to

Re: Select performance variation based on the different combinations of using where lower(), order by, and limit at 2013-08-19 03:33:29 from Kevin Grittner

Browse pgsql-general by date

	From	Date	Subject
Next Message	Basavaraj	2013-08-19 04:30:58	Re: need help
Previous Message	Kevin Grittner	2013-08-19 03:33:29	Re: Select performance variation based on the different combinations of using where lower(), order by, and limit