Quick Links

Optimizing a huge_table/tiny_table join

From:	<kynn(at)panix(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Optimizing a huge_table/tiny_table join
Date:	2006-05-24 15:49:52
Message-ID:	200605241549.k4OFnqx05310@panix3.panix.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records. For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches. In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
enable_seqscan on: less than 1 s.

Clearly, using LIMIT is the way to go. Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this. I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

QUERY PLAN
---------------------------------------------------------------------------------------------------------
Limit (cost=0.01..2338.75 rows=600 width=84)
-> Nested Loop (cost=0.01..14766453.89 rows=3788315 width=84)
-> Seq Scan on tiny_table t (cost=0.00..19676.00 rows=300 width=38)
-> Index Scan using huge_table_index on huge_table h (cost=0.01..48871.80 rows=12628 width=46)
Index Cond: (upper(("outer".id)::text) = upper((h.id)::text))

How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?

Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out. In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy? Is there a good way to improve
it?

Many thanks in advance!

PS: FWIW, the query plan for the query with LIMIT 6000 is this:

QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=19676.75..21327.99 rows=6000 width=84)
-> Hash Join (cost=19676.75..1062244.81 rows=3788315 width=84)
Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))
-> Seq Scan on huge_table h (cost=0.00..51292.43 rows=2525543 width=46)
-> Hash (cost=19676.00..19676.00 rows=300 width=38)
-> Seq Scan on tiny_table t (cost=0.00..19676.00 rows=300 width=38)

Browse pgsql-performance by date

	From	Date	Subject
Next Message	kynn	2006-05-24 17:42:46	Optimizing a huge_table/tiny_table join
Previous Message	Robin Ericsson	2006-05-24 10:27:41	Re: Selects query stats?