Re: Optimizing queries that use multiple tables and many order by columns

From: Joshua Berry <yoberi(at)gmail(dot)com>
To: "Wappler, Robert" <rwappler(at)ophardt(dot)com>
Cc: PostgreSQL - General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Optimizing queries that use multiple tables and many order by columns
Date: 2010-08-25 17:29:06
Message-ID: AANLkTimHPcXQp=y70BRSxxsFdBWF56nOPxMZNcJ0NeCR@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Aug 25, 2010 at 10:40 AM, Wappler, Robert <rwappler(at)ophardt(dot)com>wrote:

> On 2010-08-25, Joshua Berry wrote:
>
> > Here's my latest culprit:
> >
> > select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> > JOB.CompanyCode, Anl.SampleName
> > from analysis anl join job on anl.job = job.job
> > order by job.companycode, anl.job, anl.lab
> > limit 10;
> >
>
> Could you try to remove the limit clause? I have seen several times,
> that it may slow down a query. Although I haven't tested it, that an
> OFFSET 0 clause can improve the situation, iirc.
>

The actual query uses a cursor, which seems to run the query and after the
entire set is ready to be fetched, it will be able to allow fetching. So,
I'm not using a limit in the actual queries, but something like this:

declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;
fetch 10 in "SQL_CUR0453D910";
close "SQL_CUR0453D910";

From an algebraic point of view, I cannot see obvious inefficiencies.
> Others, which now the internals of pg better, might see more.
>

> That are estimated query plans, what does EXPLAIN ANALYZE say? The query
> plans above do not execute the query instead they just make a rough
> guess about the costs. Reality might be different. Also you may want to
> run VACUUM ANALYZE before.
>
>
The database is vacuum analyze'd and the stat target is the default of 100.
I'm also using PG 8.4.4 running on Centos 5.5 x86_64

--Here's what explain analyze says for the query
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;

Sort (cost=38047.92..38495.65 rows=179095 width=32) (actual
time=1890.796..2271.248 rows=178979 loops=1)
Sort Key: job.companycode, anl.job, anl.lab
Sort Method: external merge Disk: 8416kB
-> Hash Join (cost=451.20..18134.05 rows=179095 width=32) (actual
time=8.239..260.848 rows=178979 loops=1)
Hash Cond: (anl.job = job.job)
-> Seq Scan on analysis anl (cost=0.00..14100.95 rows=179095
width=23) (actual time=0.026..91.602 rows=178979 loops=1)
-> Hash (cost=287.20..287.20 rows=13120 width=17) (actual
time=8.197..8.197 rows=13120 loops=1)
-> Seq Scan on job (cost=0.00..287.20 rows=13120 width=17)
(actual time=0.007..4.166 rows=13120 loops=1)
Total runtime: 2286.224 ms

> Maybe, the planner decides for a Sort Join, if there are sorted indexes
> for anl.job and job.job. But the speed-up may vary depending on the
> data.
>

It seems to be reading the entire dataset, then sorting, right? There's not
much more that could be done to improve such queries, aside from increasing
memory and IO bandwidth.

But now that I've said that, there's the following query that deals with
exactly the same set of data, but the ordering involves only one of the two
joined tables.

explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode --, anl.job, anl.lab; --Only order by indexed
columns from job.

Nested Loop (cost=0.00..65305.66 rows=179095 width=32) (actual
time=0.084..288.976 rows=178979 loops=1)
-> Index Scan using job_companycode on job (cost=0.00..972.67 rows=13120
width=17) (actual time=0.045..7.328 rows=13120 loops=1)
-> Index Scan using analysis_job_lab on analysis anl (cost=0.00..4.63
rows=22 width=23) (actual time=0.006..0.015 rows=14 loops=13120)
Index Cond: (anl.job = job.job)
Total runtime: 303.230 ms

If I order by columns from the other table, analysis only, I get the follow
query and results:
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by --job.companycode,
anl.job, anl.lab; --Only order by indexed columns from analysis.

Merge Join (cost=0.56..44872.45 rows=179095 width=32) (actual
time=0.078..368.620 rows=178979 loops=1)
Merge Cond: (anl.job = job.job)
-> Index Scan using analysis_job_lab on analysis anl
(cost=0.00..35245.47 rows=179095 width=23) (actual time=0.035..128.460
rows=178979 loops=1)
-> Index Scan using job_job_pk on job (cost=0.00..508.53 rows=13120
width=17) (actual time=0.039..53.733 rows=179005 loops=1)
Total runtime: 388.884 ms

Notice that in these cases the query completes in <400 ms and the other
query that involves ordering on columns from both of the joined tables
completes in >2300ms.

In the application here, these queries are used by a client application to
fill a window's listbox that can be scrolled up or down. If the user changes
direction of the scroll, it initiates a new cursor and query to fetch a page
of results. If the scrolling motion is in the same direction, it simply
continues to fetch more results from the cursor. But each time the direction
of movement changes, there can be a significant lag.

Any suggestions would be helpful! I'll assume for now that the indexes and
queries can't be improved, but rather that I should tweak more of the
postmaster settings. Please correct me if you know better and have time to
reply.

Thanks,
-Joshua

P.S. Is it possible to have indexes that involves several columns from
different but related tables? If so, where can I learn about them?

> > Thanks for any insights and tips you can provide!
> >
> > Kind Regards,
> > -Joshua Berry
> >
> >
> >
>
> HTH
>
> --
> Robert...
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2010-08-25 18:03:15 Re: [GENERAL] initdb fails to allocate shared memory
Previous Message Steve Clark 2010-08-25 17:06:46 Re: Feature proposal