Re: what is less resource-intensive, WHERE id IN or INNER JOIN?

From: "Scott Carey" <scott(at)richrelevance(dot)com>
To: Miernik <public(at)public(dot)miernik(dot)name>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: what is less resource-intensive, WHERE id IN or INNER JOIN?
Date: 2008-07-31 06:47:40
Message-ID: a1ec7d000807302347q1e1da297y382da462b723d855@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

WHERE id IN will generally lead to faster query plans. Often, much faster
on large tables.

The inner join deals with duplicate values for id differently. WHERE id IN
( subquery ) will be much more likely to choose a hash method for filtering
the scan on table1.

I just ran into this today on a query of similar nature on a table with 2M
rows (table1 below) joining on a table with 100k (table 2). The speedup was
about 10x -- 20 seconds to 2 seconds in my case -- but I'm working with many
GB of RAM and 800MB of work_mem.

I'm not an expert on the query planner guts -- the below is all knowledge
based on experimenting with queries on my dataset and triggering various
query plans. Due to the differences in semantics of these two queries the
plan on larger tables will be two sorts and a merge join for the INNER JOIN
on most columns, though indexes and especially unique indexes will change
this. This holds true even if the number of distinct values of the column
being joined on is very low.

The WHERE id IN version will produce a much faster query plan most of the
time, assuming you have enough work_mem configured. The planner is
conservative if it estimates usage of work_mem to overflow even a little bit
-- and often shifts to sorts rather than hashes on disk.

Since you are working with such a small ammount of RAM, make sure you have
some of it doled out to work_mem and tune the balance between work_mem, the
OS, and the shared buffers carefully. The conservative, sort-if-uncertain
nature of the query planner may need some coersion with an unusual
environment such as yours. You may even have faster results with a hash
overflown to disk than a sort overflown to disk with that little memory if
the % of overflow is small enough and the OS disk cache large enough. Plus,
virtual machines sometimes do some odd things with caching non sync disk
writes that may distort the usual random versus sequential disk cost for
small I/O volumes. Though my VM experience is VMWare not Xen.
The querry planner won't generally go for hashes on disk on purpose however,
so you might need to be creative with manual statistics setting or changing
the optimizer cost settings to experiment with various query plans and
measure the unique aspects of your atypical environment and your data.

On Wed, Jul 30, 2008 at 3:11 PM, Miernik <public(at)public(dot)miernik(dot)name> wrote:

> AFAIK, provided bar is UNIQUE in table2 (e.g. is a PRIMARY KEY) the two
> queries will give the same result:
>
> SELECT foo, id FROM table1 WHERE id IN (SELECT id FROM table2);
>
> SELECT foo, id FROM table1 INNER JOIN table2 USING (id);
>
> Given table1 has about 100k rows, and table2 about 100 rows, which one
> should be faster, less resource intensive, use less RAM, disk access, etc?
> Are there any other even better ways to acomlish the same query?
>
> Using 8.3.3 on a 48 MB RAM Xen.
>
> --
> Miernik
> http://miernik.name/
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Richard Huxton 2008-07-31 07:36:10 Re: how to fix problem then when two queries run at the same time, it takes longer to complete then if run in sequence
Previous Message Miernik 2008-07-31 06:15:22 Re: how to fix problem then when two queries run at the same time, it takes longer to complete then if run in sequence