Re: slow IN() clause for many cases

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: slow IN() clause for many cases
Date: 2005-10-16 16:03:33
Message-ID: 87u0fhe9ey.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> If the fraction of records matching the IN-list is so large as to make
> that an issue, I'd think the planner would prefer a seqscan anyway.
> Besides which, it's a bit silly to worry about whether a plan is
> fast-start without taking into account the amount of planner time needed
> to create it.

Well I'm thinking about the cases where there's a short IN-list but many
records match those clauses. Something like:

WHERE user_status IN ('active','pending')

Where there could be thousands of matching records.

> Another point here is that LIMIT without any ORDER BY isn't an amazingly
> useful case. Neither the old implementation of OR indexscans nor the
> new can produce ordered output, which means you're not going to get
> fast-start behavior anyway for real queries.

That's true. That's why I was wondering more about cases where the client end
was going to read all the records until it found the record it's looking for
or found enough records for its purposes.

There are other uses of fast-start as well. If the records are going to be put
through a slow process it can be more efficient to pipeline them through while
the later records are still coming from the database.

But I guess this is all moot if the option isn't there, at least not without a
lot more programming.

The example above raises another idea though. Would it be possible for the
optimizer to recognize when a clause is so expansive that it would be faster
to read the complement than the actual clause as written?

That could be useful with bitmap operations if it meant less time reading the
index to build the bitmap but the eventual result after the bitmap operations
is restrictive enough to justify an index scan.

eg a case where 90% of users are active, and 10% have pending set and there's
an index on each of these:

WHERE user_status = 'active' AND pending = true

If the optimizer tries to a bitmap index scan now it has to read in a huge
index; it's probably better off just using the pending index alone. But if it
realizes that it can use the index to find the tuples that *aren't* active and
then take the complement of that bitmap it could build a bitmap reading in
only 10% of that index.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2005-10-16 16:26:40 Re: Advice needed concerning Win32 signals
Previous Message Bruce Momjian 2005-10-16 15:50:09 Re: [HACKERS] roundoff problem in time datatype