Re: not using indexes on large table

From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Jeroen Kleijer <jeroen(dot)kleijer(at)xs4all(dot)nl>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: not using indexes on large table
Date: 2007-04-26 19:17:49
Message-ID: 1177615069.15085.65.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, 2007-04-21 at 15:17, Jeroen Kleijer wrote:
> Hi all,
>
> I'm a bit new to PostgreSQL and database design in general so forgive me
> for asking stupid questions. ;-)
>
> I've setup a PostgreSQL database on a Linux machine (2 processor, 1GB
> mem) and while the database itself resides on a NetApp filer, via NFS,
> this doesn't seem to impact the performance to drastically.

What does a benchmark like bonnie++ say about your performance? And I
hope your data's not too important to you, because I've had LOTS of
problems with NFS mounts in the past with pgsql. Generally speaking,
NFS can be moderately fast, or moderately reliable (for databases) but
it generally isn't both at the same time.

Considering the cost of a quartet of 80 Gig SATA drives ($59x4) and a
decent RAID controller (LSI, Areca at ~$450 or so) you could be getting
VERY good performance out of your system with real reliability at the
same time on a RAID-10 volume. Then use the NetApp for backup. That's
what I'd do.

> I basically use it for indexed tables without any relation between 'em
> so far this has worked perfectly.
>
> For statistics I've created the following table:
> volume varchar(30),
> qtree varchar(255),
> file varchar(512),
> ctime timestamp,
> mtime timestamp,
> atime timestamp
> annd created separate indexes on the volume and qtree columns.

You might want to look at setting this up as two or three tables with a
view and update triggers to look like one table to the user, and the
qtree and file in their own tables. that would make your main stats
table only one varchar(30) and 3 timestamps wide. Especially if qtree
and file tend to be large. If one of those tends to be small and the
other large, then look at moving just the large one into its own table.
The reasons for this will be obvious later on in this post.

> The problems comes when I try to do a query without using a where clause
> because by then, it completely discards the indexes and does a complete
> table scan which takes over half an hour! (40.710.725 rows, 1110258
> pages, 1715 seconds)

Yes it does, and it should.

Why? Visibility. This has been discussed quite a bit on the lists.
Because of the particular design for PostgreSQL's MVCC implementation,
indexes cannot contain visibility information on tables. Therefore,
every time the db looks in an index, it then has to look in the table
anyway to find the right version of that tuple and to see if it's
actually valid for your snapshot.

> Can anyone tell me if this is normal behaviour (half an hour seems over
> the top to me) and if not, what I can do about it.

Yes this is normal behaviour. It's just how PostgreSQL works. There
are some workarounds our there that involve updating extra tables that
carry things like counts etc... Each of these cost something in
overhead.

There are two distinct problems here. One is that you're tying to use
PostgreSQL in a role where perhaps a different database might be a
better choice. MSSQL Server or DB2 or even MySQL might be a better
choice depending on what you want to do with your data.

The other problem is that you're using an NFS server. Either go whole
hog and buy a SAN with dual 2G nics in it or put local storage underneat
your machine with LOTS of hard drives in RAID-10.

Note that while other databases may be better at some of the queries
you're trying to run, it might be that PostgreSQL is still a good choice
because of other queries, and you can do rollups of the data that it's
slow at while using it for the things it is good at.

I've got a test db on my workstation that's pretty big at 42,463,248
rows and taking up 12 Gigs just for the table, 7.7 Gigs in indexes, and
a select count(*) on it takes 489 seconds. I try not to do things like
that. It covers the last 9 months of statistics.

This query:

select a, b, count(*) from summary where atime > '2006-06-16' and
perspective = 'yada'
group by a, b
order by a, b

took 300 seconds, which is typical.

This is on a Workstation with one CPU, 2 gigs of ram, and a 150 Gig SATA
drive. It's running X Windows, with Evolution, firefox, and a dozen
other user apps up and running. Our "real" server, with 4 disks in a
RAID 5 on a mediocre RAID controller but with 2 CPUs and 6 gigs of ram,
stomps my little work station into the ground.

I have the feeling my laptop with 512 Meg of ram and a 1.6 GHz CPU would
be faster than your current server.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Steve Crawford 2007-04-26 19:49:27 Re: [GENERAL] PostgreSQL Performance Tuning
Previous Message Leonel 2007-04-26 19:02:57 Re: Feature request - have postgresql log warning when new sub-release comes out.