Re: Performance With Joins on Large Tables

From: "Joshua Marsh" <icub3d(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org, jim(at)nasby(dot)net
Subject: Re: Performance With Joins on Large Tables
Date: 2006-09-13 14:49:24
Message-ID: 38242de90609130749u976295fwb10904ff4797b920@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 9/13/06, Jim C. Nasby <jim(at)nasby(dot)net> wrote:
> On Tue, Sep 12, 2006 at 04:17:34PM -0600, Joshua Marsh wrote:
> > data=# EXPLAIN SELECT v.phonedirect, v.editdrop, EXTRACT (EPOCH FROM
> > v.custdate), EXTRACT (YEAR FROM s.custdate) || '-' || EXTRACT (MONTH
> > FROM s.custdate) FROM view_505 v INNER JOIN r3s169 s ON v.dsiacctno =
> > s.dsiacctno;
> > QUERY PLAN
> > -----------------------------------------------------------------------------------------------
> > Merge Join (cost=293767607.69..305744319.52 rows=285392608 width=11)
> > Merge Cond: ("outer".dsiacctno = "inner".dsiacctno)
> > -> Sort (cost=127304933.87..127585815.71 rows=112352736 width=20)
> > Sort Key: v.dsiacctno
> > -> Seq Scan on view_505 v (cost=100000000.00..104604059.36
> > rows=112352736 width=20)
> > -> Sort (cost=166462673.82..167176155.34 rows=285392608 width=17)
> > Sort Key: s.dsiacctno
> > -> Seq Scan on r3s169 s (cost=100000000.00..106875334.08
> > rows=285392608 width=17)
> > (8 rows)
> >
> >
> > Since enable_seqscan is off, my understanding is that in order for the query
> > planner to user a sequential scan it must think there is no other
> > alternative.
> > Both sides are indexed and anaylzed, so that confuses me a little.
> >
> > I tried it on a smaller sample set of the data and it works fine:
>
> Actually, enable_seqscan=off just adds a fixed overhead to the seqscan
> cost estimate. That's why the cost for the seqscans in that plan starts
> at 100000000. I've suggested changing that to a variable overhead based
> on the expected rowcount, but the counter-argument was that anyone with
> so much data that the fixed amount wouldn't work would most likely be
> having bigger issues anyway.
>
> Other things you can try to get the index scan back would be to reduce
> random_page_cost and to analyze the join fields in those tables with a
> higher statistics target (though I'm not 100% certain the join cost
> estimator actually takes that into account). Or if you don't mind
> patching your source code, it wouldn't be difficult to make
> enable_seqscan use a bigger 'penalty value' than 10000000.
> --
> Jim Nasby jim(at)nasby(dot)net
> EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
>

Thanks for the tip. I lowered random_page_cost and got these results:

data=# EXPLAIN SELECT v.phonedirect, v.editdrop, EXTRACT (EPOCH FROM
v.custdate), EXTRACT (YEAR FROM s.custdate) || '-' || EXTRACT (MONTH
FROM s.custdate) FROM view_505 v INNER JOIN r3s169 s ON v.dsiacctno =
s.dsiacctno;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Merge Join (cost=0.00..20921221.49 rows=285230272 width=11)
Merge Cond: ("outer".dsiacctno = "inner".dsiacctno)
-> Index Scan using view_505_dsiacctno on view_505 v
(cost=0.00..2838595.79 rows=112393848 width=20)
-> Index Scan using r3s169_dsiacctno on r3s169 s
(cost=0.00..7106203.68 rows=285230272 width=17)
(4 rows)

That seems to have done it. Are there any side effects to this
change? I read about random_page_cost in the documentation and it
seems like this is strictly for planning. All the tables on this
database will be indexed and of a size similar to these two, so I
don't see it causing any other problems. Though I would check though
:)

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2006-09-13 15:32:48 Re: sql-bench
Previous Message Mark Lewis 2006-09-13 14:44:30 Re: sql-bench