Quick Links

Re: Optimizing queries that use multiple tables and many order by columns

From:	"Wappler, Robert" <rwappler(at)ophardt(dot)com>
To:	"Joshua Berry" <yoberi(at)gmail(dot)com>, "PostgreSQL - General" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Optimizing queries that use multiple tables and many order by columns
Date:	2010-08-25 15:40:18
Message-ID:	C8E2DAF0E663A948840B04023E0DE32A02B1F4CC@w2k3server02.de.ophardt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 2010-08-25, Joshua Berry wrote:

> Hi Group,
>
> I've never really learned how to optimize queries that join
> several tables and have order by clauses that specify columns
> from each table. Is there documentation that could help me
> optimize and have the proper indexes in place? I've read
> through the PG Docs Chapter 11 on Indexes yet still lack the
> needed understanding.
>
> Here's my latest culprit:
>
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode, anl.job, anl.lab
> limit 10;
>

Could you try to remove the limit clause? I have seen several times,
that it may slow down a query. Although I haven't tested it, that an
OFFSET 0 clause can improve the situation, iirc.

From an algebraic point of view, I cannot see obvious inefficiencies.
Others, which now the internals of pg better, might see more.

> Here's the query plan using PG 8.4.4:
> Limit (cost=21990.24..21990.27 rows=10 width=32)
> -> Sort (cost=21990.24..22437.69 rows=178979 width=32)
> Sort Key: job.companycode, anl.lab
> -> Hash Join (cost=451.20..18122.57 rows=178979 width=32)
> Hash Cond: (anl.job = job.job) -> Seq Scan on analysis
> anl (cost=0.00..14091.79 rows=178979 width=23) -> Hash
> (cost=287.20..287.20 rows=13120 width=17)
> -> Seq Scan on job (cost=0.00..287.20
> rows=13120 width=17)
>
>
> If I change the above query to only order by one of the
> tables, I get better results:
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode --, anl.job, anl.lab
> limit 10;
> Limit (cost=0.00..3.65 rows=10 width=32)
> -> Nested Loop (cost=0.00..65269.13 rows=178979 width=32)
> -> Index Scan using job_companycode on job (cost=0.00..972.67
> rows=13120 width=17) -> Index Scan using analysis_job_lab on
> analysis anl
> (cost=0.00..4.63 rows=22 width=23)
> Index Cond: (anl.job = job.job)
>

That are estimated query plans, what does EXPLAIN ANALYZE say? The query
plans above do not execute the query instead they just make a rough
guess about the costs. Reality might be different. Also you may want to
run VACUUM ANALYZE before.

> Any idea on how I can improve this? In the past I would tend
> to create a cached copy of the query as a table that would be
> utilized, but I suspect that there is a better way to go
> about this. I'm using a system (Clarion) which heavily uses
> cursors via the ODBC driver (I use the psqlODBC latest
> version) to get a handful of records at a time, so no actual
> LIMITs would be used in the production queries; I've added
> the LIMITs here to try to simulate the performance
> differences that I find when browsing the data while ordering
> by the above columns.
>
>
> Here are the relevant tables and indexes:
>
>
> CREATE TABLE job
> (
> job bigint NOT NULL, -- Job #
> companycode character(4), -- Company Code
> recdbycode character(3), -- Initials of who checked in sample(s)
> datein date, -- Date sample was received
> project character varying, -- Project or Site name
> remarks text, -- Remarks
> --[CONSTRAINTs etc]
> )
>
> CREATE INDEX job_companycode
> ON job
> USING btree
> (companycode);
> CREATE INDEX job_companycode_job
> ON samples.job
> USING btree
> (companycode, job);
>

Index job_companycode is not used in the plans. Additionally, it can be
constructed from the second index, as companycode is the primary sort
key.

> CREATE TABLE analysis
> (
> lab bigint NOT NULL, -- Lab number
> job bigint, -- Job number
> sampletype character varying(5), -- General class of sample
> priority character(1), -- Priority level
> samplename character varying, -- Sample or Well name
> CONSTRAINT rel_joblabk_to_jobk FOREIGN KEY (job)
> REFERENCES job (job) MATCH SIMPLE
> ON UPDATE CASCADE ON DELETE RESTRICT,
> --[CONSTRAINTs etc]
> )
>
> CREATE INDEX analysis_companycode_job_lab
> ON analysis
> USING btree
> (companycode, job, lab);
>
> CREATE INDEX analysis_job_lab
> ON analysis
> USING btree
> (job, lab);
>

Maybe, the planner decides for a Sort Join, if there are sorted indexes
for anl.job and job.job. But the speed-up may vary depending on the
data.

> Thanks for any insights and tips you can provide!
>
> Kind Regards,
> -Joshua Berry
>
>
>

HTH

--
Robert...

In response to

Optimizing queries that use multiple tables and many order by columns at 2010-08-25 11:32:58 from Joshua Berry

Responses

Re: Optimizing queries that use multiple tables and many order by columns at 2010-08-25 17:29:06 from Joshua Berry

Browse pgsql-general by date

	From	Date	Subject
Next Message	A.M.	2010-08-25 15:41:03	Re: initdb fails to allocate shared memory
Previous Message	Thom Brown	2010-08-25 15:35:47	Re: unexpected message type 0x58 during COPY from stdin