Re: How to improve speed of 3 table join &group (HUGE tables)

From: John Major <major(at)cbio(dot)mskcc(dot)org>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: How to improve speed of 3 table join &group (HUGE tables)
Date: 2007-10-18 19:46:19
Message-ID: 4717B80B.6050706@cbio.mskcc.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello Nis-

I did reset the defaults before running the explain.

Primary keys for the tables.
sequence_fragment.seq_frag_id
sequence.sequence_id

Candidate keys.
fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id)
sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id).

None of the fields are nullable.

sequence is the anchor table.
seq_frag_id is the primary key (and foreign key to
fragment_external_info) ~4.5 million unique entries
sequence_id is an indexed field. ~3 million distinct IDs

sequence_alignment has 500million entries, but i join on sequence_id
which has ~3million entries.

When I make the suggested changes, the new query is:
select sf.library_id, fio.clip_type , count(sf.sequence_id)
from sequence_fragment sf, fragment_external_info fio
where sf.seq_frag_id = fio.sequence_frag_id
and sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)
group by sf.library_id, fio.clip_type

After making the 2 changes, the cost dropped dramatically... but is
still very high.
Original Explain cost:
cost=1308049564..1345206293 rows=54 width=16

New Explain cost:
cost=11831119..11831120 rows=54 width=16

John

Nis Jørgensen wrote:
> John Major skrev:
>
>> I am trying to join three quite large tables, and the query is
>> unbearably slow(meaning I can't get results in more than a day of
>> processing).
>> I've tried the basic optimizations I understand, and nothing has
>> improved the execute speed.... any help with this would be greatly
>> appreciated
>>
>>
>> The three tables are quite large:
>> sequence_fragment = 4.5 million rows
>> sequence_external_info = 10million rows
>> sequence_alignment = 500 million rows
>>
>>
>> The query I am attempting to run is this:
>>
>> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
>> from sequence_alignment sa, sequence_fragment sf,
>> fragment_external_info fio
>> where sf.seq_frag_id = fio.sequence_frag_id
>> and sf.sequence_id = sa.sequence_id
>> group by sf.library_id, fio.clip_type
>>
>>
>> NOTES:
>> ~there are indexes on all of the fields being joined (but not on
>> library_id or clip_type ). ~Everything has been re-analyzed post index
>> creation
>>
>
> What are the primary (and candidate) keys of the tables? Are any of the
> fields nullable? How many distinct values exist for
> sequence_alignment.sequence_id?
>
>
>> ~I've tried "set enable_seqscan=off" and set (join_table_order or
>> something) = 1
>>
>
>
> It would help if you turned the settings back to defaults before doing
> the ANALYZE - or provide the results of that case as well.
>
>
>> The explain plan is as follows:
>>
>
> [cut]
>
> Without trying to understand the ANALYZE output, I would suggest two
> possible optimizations:
>
> - Using count(distinct(sf.sequence_id)) instead of
> count(distinct(sa.sequence_id)).
>
> - Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
> (SELECT sequence_id from sequence_alignment)".
>
> The first one probably won't help (nor hurt), but the second one might
> be able to get rid of the table scan, or at least the need do the full
> merge join (which returns an estimated 3 billion rows).
>
> Hope this helps,
>
> Nis
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Chris Kratz 2007-10-18 20:03:50 Re: Incorrect estimates on columns
Previous Message Heikki Linnakangas 2007-10-18 19:06:11 Re: How to improve speed of 3 table join &group (HUGE tables)