Re: Large table performance

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mark Dobbrow" <mdobbrow(at)yahoo(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Large table performance
Date: 2007-01-14 19:20:35
Message-ID: C1CFC283.18064%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Mark,

Note that selecting an index column means that Postgres fetches the whole
rows from disk. I think your performance problem is either: 1) slow disk or
2) index access of distributed data. If it¹s (1), there are plenty of
references from this list on how to check for that and fix it. If it¹s (2),
see below.

The performance of index accessed data in Postgres depends on how the data
is loaded into the DBMS. If the records you are fetching are distributed
widely among the 3M records on disk, then the select is going to ³hop, skip
and jump² across the disk to get the records you need. If those records are
stored more closely together, then the fetching from disk is going to be
sequential. A good example of the best situation for an index is an index
on a date column when the data is loaded sequentially by date. A query
against a specific date range will result in an ordered fetch from the disk,
which leverages fast sequential access.

The difference in performance between ordered and distributed access is
similar to the difference between ³random seek² and ³sequential² performance
of the disk subsystem. The random seek performance of typical disk
subsystems with one thread (for one user in postgres) is 120 seeks per
second. If your data was randomly distributed, you¹d expect about
10,000/120 = 83 seconds to gather these records. Since you¹re getting 10
seconds, I expect that your data is lumped into groups and you are getting a
mix of sequential reads and seeks.

Note that adding more disks into a RAID does not help the random seek
performance within Postgres, but may linearly improve the ordered access
speed. So even with 44 disks in a RAID10 pool on a Sun X4500, the seek
performance of Postgres (and other DBMS¹s without async or threaded I/O) is
that of a single disk ­ 120 seeks per second. Adding more users allows the
seeks to scale on such a machine as users are added, up to the number of
disks in the RAID. But for your one user example ­ no help.

If your problem is (2), you can re-order the data on disk by using a CREATE
TABLE statement like this:
CREATE TABLE fast_table AS SELECT * FROM slow_table ORDER BY teacher_id;
CREATE INDEX teacher_id_ix ON fast_table;
VACUUM ANALYZE fast_table;

You should then see ordered access when you do index scans on teacher_id.

- Luke

On 1/12/07 4:31 PM, "Mark Dobbrow" <mdobbrow(at)yahoo(dot)com> wrote:

> Hello -
>
> I have a fairly large table (3 million records), and am fetching 10,000
> non-contigous records doing a simple select on an indexed column ie
>
> select grades from large_table where teacher_id = X
>
> This is a test database, so the number of records is always 10,000 and i have
> 300 different teacher ids.
>
> The problem is, sometimes fetching un-cached records takes 0.5 secs and
> sometimes (more often) is takes more like 10.0 seconds
>
> (fetching the same records for a given teacher_id a second time takes about
> 0.25 secs)
>
> Has anyone seen similar behavior or know what the solution might be?
>
> any help much appreciated,
> Mark
>
>
>
> ps. My shared_buffers is set at 5000 (kernal max), and work_mem=8192
>
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Adam Rich 2007-01-15 04:52:16 Re: max() versus order/limit (WAS: High update activity, PostgreSQL vs BigDBMS)
Previous Message Luke Lonergan 2007-01-14 19:02:24 Re: Large table performance