Re: Simple join doesn't use index

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alex Vinnik <alvinnik(dot)g(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join doesn't use index
Date: 2013-01-09 04:34:11
Message-ID: CAMkU=1zFftuBD5pEd8droNEehg4AS7H1O-1scoqzdwqWSvwN6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thursday, January 3, 2013, Alex Vinnik wrote:

> Hi everybody,
>
> I have implemented my first app using PG DB and thought for a minute(may
> be two) that I know something about PG but below
> problem totally destroyed my confidence :). Please help me to restore it.
>
> Here is simple join query. It runs just fine on MS SQL 2008 and uses
> all available indexes using even bigger overall dataset.
>
> select visits.id, views.id
> from visits join views on visits.id = views.visit_id
> where visits.created_at >= '11/15/2012' and visits.created_at <
> '11/16/2012'
>
> Quick performance stat
>
> MS SQL: 1 second, 264K rows
>

If it returns 264K rows in 1 second, then it must have all data in memory.
Which prompts a few questions:

Is *all* data in memory, or is it just the data needed for this particular
query because you already ran it recently with the same date range?

> PG: 158 seconds, 264K rows
>

Does the PG machine have enough memory to hold all the data in RAM? If so,
does it actually have all the data in RAM? That is, is the cache already
warm? Starting from scratch it can take a long time for the cache to warm
up naturally. And finally, if all the data is in RAM, does PG know this?

For the last question, the answer is "no", since you are using default
settings. You need to lower random_page_cost and probably also
seq_page_cost in order to trick PG to think the data is in RAM. Of course
if you do this when the data is in fact not in RAM, the result could be
catastrophically bad plans. (And when I tried to replicate your situation
on anemic hardware, indeed the hash join often was faster than the nested
loop on both indexes.)

>
> Explain plan from both DBs
>
> PG QUERY PLAN
> Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
> Hash Cond: (views.visit_id = visits.id)
> -> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)
>

This cost estimate is probably due mostly to seg_page_cost and
cpu_tuple_cost, which at their defaults means the table has 645,000 blocks
(819136 - 17434456/100) blocks and each block has ~30 rows.

But you are returning 248,494 rows, or roughly 1 / 2.5 of a row per block.
Let's say you need to fetch 200,000 blocks (in random order) to get those
rows. Since at default settings fetching 200,000 random blocks is
considered as expensive as fetching 800,000 sequential blocks, the index
scan you want already looks more expensive than the sequential scan. But,
if you want to use the index scan, you also have to fetch the index
blocks, which a sequential scan does not need to do. There are probably
about 50,000 index blocks, but each one has to be fetched about 5 times
(248,494/50,000). Because your effective_cache_size is so low, PG assumes
the next time it needs to fetch the same block, it will no longer be in
memory and so needs to be fetched again at full random_page_cost.

> It is clear that PG does full table scan "Seq Scan on views
> (cost=0.00..819136.56 rows=17434456 width=8)"
>
> Don't understand why PG doesn't use views_visit_id_index in that query but
> rather scans whole table. One explanation I have found that when resulting
> dataset constitutes ~15% of total number of rows in the table then seq scan
> is used.
>

I don't know where you found that rule of thumb, but it would probably more
accurate if it was given in in terms of the percentage of the table's
*blocks* scanned, rather than *rows*.

> In this case resulting dataset is just 1.5% of total number of rows.
>

Since there are about 30 rows per block, scanning 1.5% of the rows means
scanning somewhat less than 45% of the blocks, assuming the rows are
randomly distributed over the blocks. And they are scanned in a less
efficient way.

> Postgres 9.2
>

You are probably getting hit hard by the overly-large "fudge factor"
penalty for scans of large indexes, of much discussion recently in regards
to 9.2.

> Ubuntu 12.04.1 LTS
> shared_buffers = 4GB the rest of the settings are default ones
>

The default effective_cache_size is almost certainly wrong, and if the
analogy to MSSQL to is correct, then random_page_cost almost certainly is
as well.

Another poster referred you to the wiki page for suggestion on how to
report slow queries, particularly using EXPLAIN (analyze, buffers) rather
than merely EXPLAIN. In this case, I would also try setting
enable_hashjoin=off and enable_mergejoin=off in the session, in order to
force the planner to use the plan you think you want, so we can see what PG
thinks of that one.

Cheers,

Jeff

>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Vladimir Sitnikov 2013-01-09 09:30:05 Re: FW: performance issue with a 2.5gb joinded table
Previous Message Alan Hodgson 2013-01-08 23:24:33 Re: Two Necessary Kernel Tweaks for Linux Systems