Slow performance with join on many fields

From: Alex Johnson <alex(at)aretesystems(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Slow performance with join on many fields
Date: 2003-03-04 05:48:49
Message-ID: F8F237F6-4E04-11D7-85B6-000502FCE08D@aretesystems.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello-

I'm working with a MS Access database that a client wants to turn into
a multi-user database. I'm evaluating PostgreSQL for that purpose (I'd
_really_ like to be able to recommend and open-source solution to
them). However, I'm running into a performance-related issue that I was
hoping this list could help me with.

I have three tables: tbl_samples (~2000 rows), tbl_tests (~4000 rows),
and tbl_results (~20,000 rows), with one-to-many relationships between
them (cascading, in the order given - table definitions are attached
below). I'm looking at the following query that joins these three
tables:

SELECT
tbl_samples.station_id,
tbl_samples.samp_date,
tbl_samples.matrix,
tbl_samples.samp_type_code,
tbl_samples.samp_no,
tbl_samples.field_samp_id,
tbl_tests.method,
tbl_tests.lab,
tbl_results.par_code,
tbl_results.val_qualifier,
tbl_results.value,
tbl_results.units,
tbl_results.mdl,
tbl_results.date_anal
FROM
(tbl_samples
INNER JOIN tbl_tests USING
(station_id,
samp_date,
matrix,
samp_type_code,
samp_no,
samp_bdepth,
samp_edepth)
)
INNER JOIN tbl_results USING
(station_id,
samp_date,
matrix,
samp_type_code,
samp_no,
samp_bdepth,
samp_edepth,
method);

In Access, this query runs in about a second. In PostgreSQL on the same
machine, it takes about 12-15 seconds for the initial rows to be
returned, and about 45 seconds to returns all rows. (This is consistent
whether I use psql, use the pgAdminII SQL window, or use Access with
the ODBC driver.)

This is the output from EXPLAIN:

Nested Loop (cost=437.73..1216.02 rows=1 width=245)
Join Filter: ("outer".method = "inner".method)
-> Merge Join (cost=437.73..461.38 rows=125 width=131)
Merge Cond: (("outer".matrix = "inner".matrix) AND
("outer".samp_edepth = "inner".samp_edepth) AND ("outer".samp_bdepth =
"inner".samp_bdepth) AND ("outer".samp_no = "inner".samp_no) AND
("outer".samp_type_code = "inner".samp_type_code) AND
("outer".samp_date = "inner".samp_date) AND ("outer".station_id =
"inner".station_id))
-> Sort (cost=117.51..120.77 rows=1304 width=63)
Sort Key: tbl_samples.matrix, tbl_samples.samp_edepth,
tbl_samples.samp_bdepth, tbl_samples.samp_no,
tbl_samples.samp_type_code, tbl_samples.samp_date,
tbl_samples.station_id
-> Seq Scan on tbl_samples (cost=0.00..50.04 rows=1304
width=63)
-> Sort (cost=320.22..328.68 rows=3384 width=68)
Sort Key: tbl_tests.matrix, tbl_tests.samp_edepth,
tbl_tests.samp_bdepth, tbl_tests.samp_no, tbl_tests.samp_type_code,
tbl_tests.samp_date, tbl_tests.station_id
-> Seq Scan on tbl_tests (cost=0.00..121.84 rows=3384
width=68)
-> Index Scan using tbl_results_pkey on tbl_results
(cost=0.00..5.99 rows=1 width=114)
Index Cond: (("outer".station_id = tbl_results.station_id) AND
("outer".samp_date = tbl_results.samp_date) AND ("outer".matrix =
tbl_results.matrix) AND ("outer".samp_type_code =
tbl_results.samp_type_code) AND ("outer".samp_no = tbl_results.samp_no)
AND ("outer".samp_bdepth = tbl_results.samp_bdepth) AND
("outer".samp_edepth = tbl_results.samp_edepth))

I've done the following to try to improve performance:

-postgresql.conf:
increased shared_buffers to 384
increased sort_mem to 2048
-clustered all tables on the pkey index
-made sure the joined fields are indexed (they are through the pkeys)

As a note, vm_stat shows no paging while the query is run. Also, I
realize that these keys are large (as in the number of fields). I'll be
condensing these down to sequential IDs (e.g. a SERIAL type) for a
further test, but I'm curious why Access seems to outperform Postgres
in this instance.

My question is, am I missing anything? PostgreSQL will be a hard sell
if they have to take a performance hit.

Thanks for any suggestions you can provide. Sorry for the long e-mail,
but I wanted to provide enough info to diagnose the issue.

Alex Johnson
________________________________
Table defs:

CREATE TABLE tbl_Samples (
Station_ID VARCHAR (25) NOT NULL,
Samp_Date TIMESTAMP WITH TIME ZONE NOT NULL,
Matrix VARCHAR (10) NOT NULL,
Samp_Type_Code VARCHAR (5) NOT NULL,
Samp_No INTEGER NOT NULL,
Samp_BDepth DOUBLE PRECISION NOT NULL,
Samp_EDepth DOUBLE PRECISION NOT NULL,
Depth_units VARCHAR (3),
Samp_start_time TIME,
Samp_end_time TIME,
Field_Samp_ID VARCHAR (20),
Lab_Samp_ID VARCHAR (20),
Samp_Meth VARCHAR (20),
...snip...
PRIMARY KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth)
);

CREATE TABLE tbl_Tests (
Station_ID VARCHAR (25) NOT NULL,
Samp_Date TIMESTAMP WITH TIME ZONE NOT NULL,
Matrix VARCHAR (10) NOT NULL,
Samp_Type_Code VARCHAR (5) NOT NULL,
Samp_No INTEGER NOT NULL,
Samp_BDepth DOUBLE PRECISION NOT NULL,
Samp_EDepth DOUBLE PRECISION NOT NULL,
Method VARCHAR (50) NOT NULL,
Lab VARCHAR (10) NOT NULL,
Date_Rec TIMESTAMP WITH TIME ZONE,
...snip...
PRIMARY KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth,Method)
);

CREATE TABLE tbl_Results (
Station_ID VARCHAR (25) NOT NULL,
Samp_Date TIMESTAMP WITH TIME ZONE NOT NULL,
Matrix VARCHAR (10) NOT NULL,
Samp_Type_Code VARCHAR (5) NOT NULL,
Samp_No INTEGER NOT NULL,
Samp_BDepth DOUBLE PRECISION NOT NULL,
Samp_EDepth DOUBLE PRECISION NOT NULL,
Method VARCHAR (50) NOT NULL,
Par_code VARCHAR (50) NOT NULL,
Val_Qualifier VARCHAR (50) NOT NULL,
Value DECIMAL (20,9) NOT NULL,
...snip...
PRIMARY KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth,Method,Par_code)
);

ALTER TABLE tbl_Tests ADD CONSTRAINT REL_1 FOREIGN KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth)
REFERENCES tbl_Samples ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE tbl_Results ADD CONSTRAINT REL_2 FOREIGN KEY
(Station_ID,Samp_Date,Matrix,Samp_Type_Code,Samp_No,Samp_BDepth,Samp_EDe
pth,Method)
REFERENCES tbl_Tests ON DELETE CASCADE ON UPDATE CASCADE;

________________________________________________________________________
______
A r e t e S y s t e m s
Alexander M. Johnson, P.E.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2003-03-04 06:27:42 Re: Slow performance with join on many fields
Previous Message Orito 2003-03-01 23:36:46 Re: ABOUT YOUR CREDIT........... eyr