Expected performance of querying 5k records from 4 million records?

From: Anish Kejariwal <anishkej(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Expected performance of querying 5k records from 4 million records?
Date: 2012-06-15 16:17:39
Message-ID: CAOpcnr8S2hmV1z-=ZF1k+t+uWWyw82nUbwhpQmdf=8R_Kpv78g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I've been struggling with this issue for the last several days, and I feel
like I'm running into a few different issues that I don't understand. I'm
using postgres 9.0.8, and here's the OS I'm running this on:
inux 2.6.18-308.4.1.el5xen #1 SMP Tue Apr 17 17:49:15 EDT 2012 x86_64
x86_64 x86_64 GNU/Linux

from show all:
shared_buffers | 4GB
work_mem | 192MB
maintenance_work_mem | 1GB
effective_cache_size | 24GB
wal_sync_method | fdatasync
random_page_cost | 4

My situtation: I have an empty parent table, that has 250 partitions. Each
partition has 4 million records (250 megs). I'm querying 5k records
directly from one partition (no joins), and it's taking ~2 seconds to get
the results. This feels very slow to me for an indexed table of only 4
million records.

Quick overview of my questions::
1. expected performance? tips on what to look into to increase performance?
2. should multicolumn indices help?
3. does reindex table cache the table?

Below are the tables, queries, and execution plans with my questions with
more detail. (Since I have 250 partitions, I can query one partition after
the other to ensure that I'm not pulling results form the cache)

Parent table:
# \d data
Table "public.data"
Column | Type | Modifiers
--------------+------------------+-----------
data_id | integer | not null
dataset_id | integer | not null
stat | double precision | not null
stat_id | integer | not null
Number of child tables: 254 (Use \d+ to list them.)

Child (partition) with ~4 million records:

\d data_part_201
genepool_1_11=# \d data_part_201
Table "public.data_part_201"
Column | Type | Modifiers
--------------+------------------+-----------
data_id | integer | not null
dataset_id | integer | not null
stat | double precision | not null
stat_id | integer | not null
Indexes:
"data_unq_201" UNIQUE, btree (data_id)
"data_part_201_dataset_id_idx" btree (dataset_id)
"data_part_201_stat_id_idx" btree (stat_id)
Check constraints:
"data_chk_201" CHECK (dataset_id = 201)
Inherits: data

explain analyze select data_id, dataset_id, stat from data_part_201 where
dataset_id = 201
and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

Bitmap Heap Scan on data_part_201 (cost=115.79..14230.69 rows=4383
width=16) (actual time=36.103..1718.141 rows=5350 loops=1)
Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
Filter: ((dataset_id = 201) AND (stat_id = 6))
-> Bitmap Index Scan on data_unq_201 (cost=0.00..114.70 rows=5403
width=0) (actual time=26.756..26.756 rows=5350 loops=1)
Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
Total runtime: 1728.447 ms
(6 rows)

Time: 1743.535 ms

QUESTION 1: you can see that the query is very simple. is this the optimal
execution plan? any tips on what to look into to increase performance?

I then tried adding the following multi-column index:
"data_part_202_dataset_regionset_data_idx" btree (dataset_id, data_id,
stat_id)

The query now takes 27 seconds!:
explain analyze select data_id, dataset_id, stat from data_part_202 where
dataset_id = 202
and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

Index Scan using data_part_202_dataset_regionset_data_idx on data_part_202
(cost=0.00..7987.83 rows=4750 width=16) (actual time=39.152..27339.401
rows=5350 loops=1)
Index Cond: ((dataset_id = 202) AND (data_id >= 50544630) AND (data_id
<= 50549979) AND (stat_id = 6))
Total runtime: 27349.091 ms
(3 rows)

QUESTION 2: why is a multicolumn index causing the query to run so much
slower? I had expected it to increase the performance

QUESTION 3:
If I do the following: reindex table data_part_204 the query now takes
50-70 milliseconds. Is this because the table is getting cached? How do I
know if a particular query is coming from the cache? The reason why I think
"reindex table" is caching the results, is that select count(*) from the
partition also causes the query to be fast.

(and yes, vacuum analyze on the partition makes no difference)

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2012-06-15 17:44:45 Re: Expected performance of querying 5k records from 4 million records?
Previous Message Siddharth Shah 2012-06-14 15:15:58 High CPU Usage