Performance for seq. scans

From: Jules Bean <jules(at)jellybean(dot)co(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Performance for seq. scans
Date: 2000-07-26 10:51:33
Message-ID: 20000726115132.C29809@grommit.office.vi.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

I've had a look over the docs and the FAQ and I can't see anything
answering this, so here goes:

I'm in the (slightly unusual, in a relational world) situation that
the dominant query on my database is a wildcard search, so that no
indexes can be used. (E.g. select * from table_a where foo like
'%bar%').

Without some /very/ clever (and disk-space intensive) subword
indexing, this query is doomed to be a sequential scan, which I'm
resigned to. It's a question of making that as fast as possible.

My dataset is around 500M as a text file on disk, and around 1500M as
postgres data. The machine I'm working on at the moment does the
search in around 90 seconds. (For comparision, MS SQL 7, the other
solution being considered here, takes around 75 seconds on identical
hardware).

Interestingly, using 'vmstat' shows that the CPU is maxxed out at 50%
(this being a dual CPU machine), while the disk access is a mere
4M/sec --- bonnie claims this machine is capable of around 25M/sec to
this particular disk. So it would seem that the bottleneck is the
CPU. [I understand why both CPUs aren't used]

My previous feeling had been that the bottleneck was going to be the
disk, in which case I was going to recommend installing enough memory
in the machine that the kernel disk cache could cache the whole file,
and thus speeding up the search. In the current situtation, it seems
like the only improvement would be to install a faster CPU (and since
we're currently using a PIII 600, I couldn't expect much more than a
60% improvement or so that way).

It seems slightly surprising that postgres can only "service" a 4M/sec
stream of data from the disk with a LIKE query -- not such a complex
query. Is there some unnecessary data copying in the critical path for
the search?

I almost forgot -- this is debian package 7.0.2-2.

Any pointers to whether or not this performance can be improved upon,
welcomed. Currently I'm feeling like the right solution may be to
dump the 500M text file periodically and run 'grep' on a machine with
enough memory to cache the text file ;-)

Jules Bean

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrea Aime 2000-07-26 11:06:21 Re: Connecting to PostgreSQL databases
Previous Message Bob Parkinson 2000-07-26 10:46:21 planner switch from index scan to seq scan?