Tuning tsearch searching and indexing

From: Eric Walstad <eric(at)ericwalstad(dot)com>
To: sfpug(at)postgresql(dot)org
Subject: Tuning tsearch searching and indexing
Date: 2009-02-10 18:00:13
Message-ID: c436fcd20902101000w66d98971ub0b49c26852fdad6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: sfpug

Hi All,

Our 'big' table is now at ~15.7M records, it holds address data with a
few other fields, too. We were using Xapian to index the address
data. Our users need to search the address data frequently. We want
to try using tsearch over Xapian for various reasons but we're having
difficulties getting the indexes built. For example, after running
for about 30hours, the indexing operation failed when a scheduled
vacuum started:

er-ca=# UPDATE er_service_address
er-ca-# SET search_tsv=to_tsvector(
er-ca(# 'default',
er-ca(# COALESCE(service_account, '') ||' '||
er-ca(# COALESCE(customer_name, '') ||' '||
er-ca(# COALESCE(addr1, '') ||' '||
er-ca(# COALESCE(city, '') ||' '||
er-ca(# COALESCE(zipcode, '') ||' '||
er-ca(# COALESCE(service_type, '') ||' '||
er-ca(# COALESCE(account_type, '')
er-ca(# )
er-ca-# ;

WARNING: server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

I'm looking for advice on using tsearch on a dataset of this size,
specifically how to speed up the generation of the indexes to units of
whole hours rather than whole days.

Thanks in advance for your advice. Details of the target system are
below; let me know if there's other settings I should post.

Eric.

System info:

$ uname -rvpio
2.6.9-42.0.2.EL #1 Thu Aug 17 17:37:40 EDT 2006 x86_64 x86_64 GNU/Linux

$ cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant Update 4)

$ free
total used free shared buffers cached
Mem: 4059812 3631484 428328 0 9128 3450236
-/+ buffers/cache: 172120 3887692
Swap: 1052248 17664 1034584

# select name, setting from pg_settings where name in ([list of fields
from http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server]);
name | setting
---------------------------+-----------
autovacuum | on
checkpoint_segments | 64
default_statistics_target | 100
effective_cache_size | 256000
max_connections | 100
max_fsm_pages | 840000
max_fsm_relations | 50000
random_page_cost | 4
server_version | 8.1.3
shared_buffers | 125000
wal_buffers | 100
wal_sync_method | fdatasync
work_mem | 20480
(13 rows)

# select pg_size_pretty(pg_database_size('er-ca'));
pg_size_pretty
----------------
17 GB
(1 row)

# \d er_service_address
Table "public.er_service_address"
Column | Type |
Modifiers
-----------------+--------------------------+-----------------------------------------------------------------
modified_by_id | integer | not null
created_by_id | integer | not null
modified_on | timestamp with time zone | not null
created_on | timestamp with time zone |
service_account | character varying(255) | not null
addr1 | character varying(255) | not null
addr2 | character varying(255) |
city | character varying(255) |
state | character varying(2) | not null
zipcode | character varying(5) |
zip4 | character varying(4) |
address_hash | character varying(128) |
fuzzy_hash | character varying(128) |
ut_id | integer | not null
meter_number | character varying(80) |
customer_name | character varying(255) |
service_type | character varying(1) |
account_type | character varying(1) |
status | character varying(2) |
id | integer | not null default
nextval('er_service_address_id_seq'::regclass)
search_tsv | tsvector |
Indexes:
"er_service_address_pkey" PRIMARY KEY, btree (id)
"er_service_address_account_type" btree (account_type)
"er_service_address_addr1" btree (addr1)
"er_service_address_address_hash" btree (address_hash)
"er_service_address_city" btree (city)
"er_service_address_created_by_id" btree (created_by_id)
"er_service_address_customer_name" btree (customer_name)
"er_service_address_service_account" btree (service_account)
"er_service_address_service_type" btree (service_type)
"er_service_address_status" btree (status)
"er_service_address_ut_id" btree (ut_id)
"er_service_address_zip4" btree (zip4)
"er_service_address_zipcode" btree (zipcode)
Foreign-key constraints:
"$1" FOREIGN KEY (modified_by_id) REFERENCES auth_user(id)
"$2" FOREIGN KEY (created_by_id) REFERENCES auth_user(id)
"$3" FOREIGN KEY (ut_id) REFERENCES er_utility(id)

Responses

Browse sfpug by date

  From Date Subject
Next Message Steve Crawford 2009-02-10 19:00:43 Anyone need a ride?
Previous Message Josh Berkus 2009-02-10 00:00:48 Re: pgDay San Francisco