Re: [SQL] "SELECT IN" Still Broken in 7.4b

From: "Shridhar Daithankar" <shridhar_daithankar(at)persistent(dot)co(dot)in>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [SQL] "SELECT IN" Still Broken in 7.4b
Date: 2003-08-22 15:11:25
Message-ID: 3F467FF5.8984.281BB3@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-sql

On 21 Aug 2003 at 16:42, Tom Lane wrote:

> Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com> writes:
> > Within the scope of the new hashed IN stuff I believe so in at least some
> > cases. I have a few million row table of integers where searching for
> > values IN (~10000 values) takes longer than creating the temp table,
> > copying into it and doing the in subquery.
>
> I did some profiling and soon realized that the main problem is the
> executor's trick for not returning the same row multiple times in a
> multiple-indexscan plan node. The point is that given
> WHERE a = 1 OR b = 1
> you could create a plan that first indexscans on a, then indexscans on
> b --- but you mustn't return any tuples in the second scan that you
> already returned in the first. IndexNext solves this by evaluating the
> prior-scan index conditions to see if they are true. Which is okay if
> there aren't too many of them. But when you have an N-element IN list
> this means you are going to do O(N^2) index expression evaluations.
> In the 10000-element IN-list test case, ExecQual gets called almost
> 50 million times :-(
>
> I'm toying with the notion of ripping out that logic and instead
> building an in-memory hashtable of already-returned TIDs. This could
> theoretically run out of memory if the multiple indexscan returns enough
> tuples, but I think in practice that wouldn't happen because the planner
> wouldn't choose an indexscan when very large numbers of tuples are being
> selected.

If I understand it correctly, we are essentially looking at ~10000 individual
index scans, in above example, isn't it? So if planner takes this in account,
it probably would not opt for seq. scan.

Other idea could be, index the in list first and search the index using
locality of values in the in list. If the 10K values in the list are between
10K-20K, they should be pretty easy to locate with single index sweep, assuming
uniform distribution. May be planner could build this IN list index before
drawing plan to determine cardinality and distribution of individual values.

On the least side, it could draw a plan for fetching a tuple range between min
and max of IN values and building in memory hash of these values for
comparison. That's would surely be cheaper than scanning entire table/result
set over ad over

Frankly, I would say a temp table is far better solution..:-)

Just a thought...

Bye
Shridhar

--
Youth doesn't excuse everything. -- Dr. Janice Lester (in Kirk's body),
"Turnabout Intruder", stardate 5928.5.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shridhar Daithankar 2003-08-22 15:17:54 Re: [HACKERS] Buglist
Previous Message Jan Wieck 2003-08-22 15:08:56 Re: [HACKERS] Buglist

Browse pgsql-sql by date

  From Date Subject
Next Message Greg Stark 2003-08-22 15:28:55 Re: [SQL] "SELECT IN" Still Broken in 7.4b
Previous Message Michele Bendazzoli 2003-08-22 14:24:27 Re: Bug on parameter bigint in PL/PGSQL