Skip site navigation (1) Skip section navigation (2)

Re: selects from large tables

From: Nikk Anderson <Nikk(dot)Anderson(at)parallel(dot)ltd(dot)uk>
To: "'Charles H(dot) Woloszynski'" <chw(at)clearmetrix(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: selects from large tables
Date: 2002-11-20 15:08:11
Message-ID: DA1274E682D3734B8802904A9B36124C298AA1@nic-nts1.nic.parallel.ltd.uk (view raw or flat)
Thread:
Lists: pgsql-performance
Hi, 

I tried a test cluster on a copy of our real data - all 10 million rows or
so.  WOW!   The normal select performance improved drastically.  
Selecting 3 months worth of data was taking 146 seconds to retrieve.  After
clustering it took 7.7 seconds!  We are now looking into ways we can
automate clustering to keep the table up to date.  The cluster itself took
around 2.5 hours.

As our backend systems are writing hundreds of rows of data in per minute
into the table that needs clustering - will cluster handle locking the
tables when dropping the old, and renaming the clustered data?  What happens
to the data being added to the table while cluster is running? Our backend
systems may have some problems if the table does not exist when it tries to
insert, and we don't want to lose any data.

Thanks

Nikk


-----Original Message-----
From: Charles H. Woloszynski [mailto:chw(at)clearmetrix(dot)com]
Sent: 18 November 2002 15:46
To: Nikk Anderson
Cc: 'Stephan Szabo'; pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] selects from large tables


Nikk:

Are you doing vaccums on these tables?  I was under the understanding 
that the estimated row count should be close to the real row count 
returned, and when it is not (as it looks in your case), the primary 
reason for the disconnect is that the stats for the tables are 
out-of-date.  

Since it used the indexes, I am not sure if the old stats are causing 
any issues, but I suspect they are not helping.  

Also, do you do any clustering of the data (since the queries are mostly 
time limited)?  I am wondering if the system is doing lots of seeks to 
get the data (implying that the data is all over the disk and not 
clustered).  

Charlie

Nikk Anderson wrote:

> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....
>
> The data in the table is actually synthetic, but the structure is the 
> same as our live system, and the queries are similar to those we 
> actually carry out. 
>
> As the data was synthetic there was a bit of repetition (19 million 
> rows of repetition!! ) of the item used in the where clause, meaning 
> that most of the table was returned by the queries - oops!  So, I have 
> done is some more realistic queries from our live system, and put the 
> time it takes, and the explain results.  Just to note that the 
> explain's estimated number of rows is way out - its guesses are way 
> too low.
>
> Typically a normal query on our live system returns between 200 and 
> 30000 rows depending on the reports a user wants to generate.  In 
> prior testing, we noted that using SELECT COUNT( ..   was slower than 
> other queries, which is why we though we would test counts first.
>
>
> Here are some more realistic results, which still take a fair whack of 
> time........
>
>
> Starting query 0
> Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021' 
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 697 ms
> Index Scan using http_timejobid on xx  (cost=0.00..17.01 rows=4 width=57)
> This query returns 500 rows of data
>
>
> Starting query 1
> Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021' 
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 15 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..705.57 rows=175 
> width=57)
> This query return 3582 rows
>
> Starting query 2
> Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021' 
> AND '2002-11-18 14:08:58.021' AND job_id = 335;
> Time taken = 65 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..3327.55 rows=832 
> width=57)
> This query returns 15692 rows
>
> Starting query 3
> Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20 
> 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
>
> Time taken = 241 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..10111.36 rows=2547 
> width=57)
> This query returns 48768 rows
>
>
> Cheers
>
> Nikk
>
>
>
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo(at)megazone23(dot)bigpanda(dot)com]
> Sent: 18 November 2002 13:02
> To: Nikk Anderson
> Cc: pgsql-performance(at)postgresql(dot)org
> Subject: Re: [PERFORM] selects from large tables
>
>
>
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>
> > Any ideas on how we can select data more quickly from large tables?
>
> Are these row estimates realistic? It's estimating nearly 20 million rows
> to be returned by some of the queries (unless I'm misreading the
> number - possible since it's 5am here).  At that point you almost
> certainly want to be using a cursor rather than plain queries since even a
> small width result (say 50 bytes) gives a very large (1 gig) result set.
>
> > - Queries and explain plans
> >
> > select count(*) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 
> width=0)
> >
> > hawkdb=# explain select count(job_id) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 
> width=4)
> >
> > hawkdb=# explain select * from table_name;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)
> >
> > hawkdb=# explain select count(*) from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412 
> width=0)
> >
> > hawkdb=# explain select * from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)
> >
> > hawkdb=# explain select * from table_name where job_id = 1;
> > NOTICE:  QUERY PLAN:
> > Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1
> > width=57)
> >
> > hawkdb=#explain select * from table_name where time > '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)
> >
> > hawkdb=# explain select * from http_result where time < '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Index Scan using table_name_time on table_name  (cost=0.00..75879.17
> > rows=19669 width=57)
>

-- 


Charles H. Woloszynski

ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015

tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com





---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org

Responses

pgsql-performance by date

Next:From: Tom LaneDate: 2002-11-20 15:16:42
Subject: Re: selects from large tables
Previous:From: Andrew SullivanDate: 2002-11-19 14:39:39
Subject: Re: Slow DELETE with IN clausule

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group