Re: usage of indexes for inner joins

From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: bayesianlogic(at)acm(dot)org
Cc: Postgres-General <pgsql-general(at)postgresql(dot)org>
Subject: Re: usage of indexes for inner joins
Date: 2007-10-01 18:13:23
Message-ID: dcc563d10710011113x6b62a4cavbd099859161af2a1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/1/07, Jan Theodore Galkowski <bayesianlogic(at)acm(dot)org> wrote:
> Scott,
>
> i didn't think this belonged in the general list, but the example i gave
> for discussion was a toy, for illustration. i could not very well post
> the actual example for many reasons, including proprietary ones and,
> given this is how things work, because the 1.5 million row table in
> question is its own smallest description.

This is the exact kind of question that belongs on -general. But it
does get asked a lot, especially by people coming from other
databases.

> while indexes are being used on that table, there's a companion table
> which is much smaller -- a "mere" 75000 rows -- which is suffering a
> sequential scan, and i was trying to eliminate those.

Well, don't think of sequential scans as plain bad. Sometimes they're
the best choice, sometimes they're not.

Now, if an index scan is provably and noticeably faster than the
sequential scan, then the planner is making the wrong decision. Have
you tried running your query with

set enable_seqscan=off;

to see how it behaves? I've found many somewhat slow queries got
really fast or really slow when I did that.

Note that you shouldn't blindly run a query all the time with that
setting, as there are many instances where seqscan is the right
answer. Also, your explain cost estimates will all be way off.

> perhaps it is true that ANALYZE isn't being done often enough. perhaps
> VACUUMs aren't being done often enough either. we're leary of
> scheduling repeated VACUUMs having encountered a case where the VACUUM
> took over an hour to complete.

Run "analyze verbose" on your db and see what it says about number of
page slots needed versus used. that will help you tell if you're
vacuuming enough.

How long vacuum takes isn't really that important. What is important
is how much of an impact it's having on the system. there are several
vacuum parameters in the postgresql.conf file that can lower the
impact vacuum has on your system I/O wise while increasing its run
time.

Vacuum full is another story. Think of it as a recovery tool, not a
periodic maintenance tool.

> it may, too, be because the tables use user-defined types heavily and
> the original UPDATE involved a SELECT ... IN ... having a GROUP BY with
> a few references to columns deep within user-defined types.

Hard to say without a query and an explain analyze output. It's
common for user defined functions to produce estimates in the planner
that are way off. user defined types, not so much. But the more
complex the query the more likely it is that the query planner will
make a bad estimate of the number of rows somewhere and choose a bad
method.

> that
> wouldn't have been my choice, but, then, they were given to me to work,
> not my design. in fact, PG is the first relational database
> implementation i've used that offered such things in a big way.

Extensibility is quite a useful tool.

> i also don't understand some other things, which are surprising, like
> why some UPDATEs take so much longer when wrapped in a BEGIN
> TRANSACTION-
> COMMIT than when having the transaction at a statement level.

that is strange. I'd expect that maybe you've got something happening
with the transaction waiting on other transactions, so that it's not
so much running hard as just tapping its toe waiting for the other
transaction to commit or roll back.

> I come from an Oracle, DB2, Informix world, and in my experience
> plans for queries are more stable. i have loitered in and around MySQL
> for a while. i'm not surprised there's a learning curve with PG. i am
> surprised it breaks so marked with mainstay database experience.

Oh, I've seen Oracle get stupid due to lack of proper statistics as
well. You like had a good DBA who kept all that stuff hidden from you
though. But PostgreSQL and mainline db experience are often
incompatible. The very things that people are used to using to make
other dbs fast (forcing index usage for instance) can make postgresql
noticeably slower.

You might find that partial index help for some circumstances. If you
are using a query that has a where clause that looks at a field that
has one value 99% of the time and another value 1% of the time, you
can index that 1% only, and an index scan will be ultra quick. The
standard case for that is a boolean field.

create table test (id int, info text, btest bool);
insert 100,000 rows, with 1% having btest=true, the rest false.
create index test_btest_true on test(btest) where btest IS TRUE;
analyze test;
explain analyze select * from test where btest is true;

Generally, postgresql offers different ways to solve the same problems
as other database, and knowing those ways can really help troubleshoot
and fix poorly performing queries.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joshua D. Drake 2007-10-01 18:15:42 PostgreSQL Conference Fall 2007, final schedule
Previous Message Abandoned 2007-10-01 18:06:12 Select too many ids..