Re: Question with hashed IN

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question with hashed IN
Date: 2003-08-19 01:43:57
Message-ID: 18896.1061257437@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I said:
> Oh, I see what it is. The initial sizing of the hash table (number of
> buckets) is done using the planner's estimate of the number of rows out
> of the subplan. In your later examples, the hash table is woefully
> overloaded and so searching it takes longer (too many items on each
> hash chain).

> I'm not sure how important this is to work on. We could try to make the
> executor's hash code more able to adapt when the hash table grows beyond
> what it was expecting (by rehashing, etc) but personally I'd rather spend
> the time on trying to improve the estimate to begin with.

After some further looking, I realized that the backend's standard hashtable
management code (dynahash.c) already has a perfectly good algorithm for
expanding hashtables when they start to get full. But the code I'd
written for hashed aggregation and grouping didn't use dynahash.c,
because the latter didn't support anything more complex than memcmp()
for comparing keys. This was obviously a dumb decision in hindsight,
so I've invested the couple hours' work needed to improve dynahash's API
and make the hashed-aggregation code use it.

As of CVS tip I get more reasonable behavior in your example:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on fktest (cost=12510.00..12532.50 rows=500 width=4) (actual time=7803.77..7803.77 rows=0 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on pktest (cost=0.00..10010.00 rows=1000000 width=4) (actual time=248.34..5408.39 rows=1000000 loops=1)
Total runtime: 7979.45 msec
(5 rows)

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on fktest (cost=1260.00..1282.50 rows=500 width=4) (actual time=4482.05..4482.05 rows=0 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on pktest (cost=0.00..1010.00 rows=100000 width=4) (actual time=0.09..2052.05 rows=1000000 loops=1)
Total runtime: 4651.02 msec
(5 rows)

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on fktest (cost=135.00..157.50 rows=500 width=4) (actual time=5830.16..5830.16 rows=0 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on pktest (cost=0.00..110.00 rows=10000 width=4) (actual time=0.03..3413.05 rows=1000000 loops=1)
Total runtime: 5994.26 msec
(5 rows)

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on fktest (cost=22.50..45.00 rows=500 width=4) (actual time=4160.85..4160.85 rows=0 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on pktest (cost=0.00..20.00 rows=1000 width=4) (actual time=0.03..1755.32 rows=1000000 loops=1)
Total runtime: 4326.24 msec
(5 rows)

Before, the same four tests gave runtimes like these on this machine:
Total runtime: 16590.97 msec
Total runtime: 13792.45 msec
Total runtime: 22702.87 msec
Total runtime: 115465.43 msec
(I forget now whether those times were taken with a backend compiled for
profiling --- so they may not be directly comparable. But at least I
can say that the increase in runtime with smaller estimates is gone.)

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2003-08-19 01:54:06 backwards-compat problem?
Previous Message Tom Lane 2003-08-19 01:29:13 Re: Qualified tables in error messages