Skip site navigation (1) Skip section navigation (2)

Re: Querying 19million records very slowly

From: Paul Ramsey <pramsey(at)refractions(dot)net>
To: Kjell Tore Fossbakk <kjelltore(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying 19million records very slowly
Date: 2005-06-21 19:59:22
Message-ID: 42B8719A.6040809@refractions.net (view raw or flat)
Thread:
Lists: pgsql-performance
Some tips:

- EXPLAIN ANALYZE provides a more useful analysis of a slow query, 
because it gives both the estimate and actual times/rows for each step 
in the plan.

- The documentation is right: rows with little variation are pretty 
useless to index. Indexing is about "selectivity", reducing the amount 
of stuff the database has to read off the the disk.

- You only have two things in your WHERE clause, so that is where the 
most important indexes reside. How many of your rows have p1=53? How 
many of your rows have happened in the last day? If your answer is "a 
lot" then the indexes are not going to help: PostgreSQL will be more 
efficient scanning every tuple than it will be jumping around the index 
structure for a large number of tuples.

- If neither time nor p1 are particularly selective individually, but 
they are selective when taken together, try a multi-key index on them both.

Paul

Kjell Tore Fossbakk wrote:

> Hello!
> 
> I use FreeBSD 4.11 with PostGreSQL 7.3.8.
> 
> I got a huge database with roughly 19 million records. There is just one
> table, with a time field, a few ints and a few strings.
> 
> 
> table test
> fields time (timestamp), source (string), destination (string), p1 (int),
> p2 (int)
> 
> 
> I have run VACUUM ANALYZE ;
> 
> I have created indexes on every field, but for some reason my postgre
> server wants to use a seqscan, even tho i know a indexed scan would be
> much faster.
> 
> 
> create index test_time_idx on test (time) ;
> create index test_source_idx on test (source) ;
> create index test_destination_idx on test (destination) ;
> create index test_p1_idx on test (p1) ;
> create index test_p2_idx on test (p2) ;
> 
> 
> 
> What is really strange, is that when i query a count(*) on one of the int
> fields (p1), which has a very low count, postgre uses seqscan. In another
> count on the same int field (p1), i know he is giving about 2.2 million
> hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
> the whole idea of indexing to increase performance in large queries.. To
> make sort of a phonebook for the values, to make it faster to look up what
> ever you need... This just seems opposite..
> 
> Here is a EXPLAIN of my query
> 
> database=> explain select date_trunc('hour', time),count(*) as total from
> test where p1=53 and time > now() - interval '24 hours' group by
> date_trunc order by date_trunc ;
>                                         QUERY PLAN
> ------------------------------------------------------------------------------------------
> Aggregate  (cost=727622.61..733143.23 rows=73608 width=8)
>    ->  Group  (cost=727622.61..731303.02 rows=736083 width=8)
>          ->  Sort  (cost=727622.61..729462.81 rows=736083 width=8)
>                Sort Key: date_trunc('hour'::text, "time")
>                ->  Seq Scan on test  (cost=0.00..631133.12 rows=736083
> width=8)
>                      Filter: ((p1 = 53) AND ("time" > (now() - '1
> day'::interval)))
> (6 rows)
> 
> 
> 
> 
> database=> drop INDEX test_<TABULATOR>
> test_source_idx         test_destination_idx        test_p1_idx       
> test_p2_idx       test_time_idx
> 
> 
> After all this, i tried to set enable_seqscan to off and
> enable_nestedloops to on. This didnt help much either. The time to run the
> query is still in minutes. My results are the number of elements for each
> hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
> about PostGreSQL, that it can easily handle 100-200million records. And
> with the right tuned system, have a great performance.. I would like to
> learn how :)
> 
> I also found an article on a page
> ( http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):
> Tip #11:  Don't bother indexing columns with huge numbers of records and a
> small range of values, such as BOOLEAN columns.
> 
> This tip, regretfully, is perhaps the only tip where I cannot provide a
> good, real-world example from my work.  So I'll give you a hypothetical
> situation instead:
> 
> Imagine that you have a database table with a list of every establishment
> vending ice cream in the US.  A simple example might look like:
> 
> Where there were almost 1 million rows, but due to simplistic data entry,
> only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
> 3-OTHER) which are relatively evenly distributed.  In this hypothetical
> situation, you might find (with testing using EXPLAIN) that an index on
> type is ignored and the parser uses a "seq scan" (or table scan) instead.
> This is because a table scan can actually be faster than an index scan in
> this situation.  Thus, any index on type should be dropped.
> 
> Certainly, the boolean column (active) requires no indexing as it has only
> two possible values and no index will be faster than a table scan.
> 
> 
> Then I ask, what is useful with indexing, when I can't use it on a VERY
> large database? It is on my 15 million record database it takes for ever
> to do seqscans over and over again... This is probably why, as i mentioned
> earlier, the reason (read the quote) why he chooses a full scan and not a
> indexed one...
> 
> So what do I do? :confused:
> 
> I'v used SQL for years, but never in such a big scale. Thus, not having to
> learn how to deal with large number of records. Usually a maximum of 1000
> records. Now, with millions, I need to learn a way to make my sucky
> queries better.
> 
> Im trying to learn more about tuning my system, makeing better queries and
> such. I'v found some documents on the Internet, but far from the best.
> 
> Feedback most appreciated!
> 
> Regards,
> a learning PostGreSQL user
> 


In response to

Responses

pgsql-performance by date

Next:From: Tobias BroxDate: 2005-06-21 20:40:57
Subject: Re: Prepared statements vs. Stored Procedures
Previous:From: Tobias BroxDate: 2005-06-21 19:46:39
Subject: Re: Limit clause not using index

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group