Re: SELECT ignoring index even though ORDER BY and LIMIT present

From: Jori Jovanovich <jori(at)dimensiology(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: SELECT ignoring index even though ORDER BY and LIMIT present
Date: 2010-06-03 15:32:00
Message-ID: AANLkTimEP72yYXjbH6fhm0-P8r0BbEfMZkmy7Ywjyeqy@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

hi,

I'm sorry for not posting this first.

The server is the following and is being used exclusively for this
PostgreSQL instance:

PostgreSQL 8.4.2 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.2.real (GCC)
4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit
Amazon EC2 Large Instance, 7.5GB memory, 64-bit

This is what is set in my postgresql.conf file:

max_connections = 100
ssl = true
shared_buffers = 24MB

ANALYZE VERBOSE EVENTS;
INFO: analyzing "public.events"
INFO: "events": scanned 30000 of 211312 pages, containing 1725088 live rows
and 0 dead rows; 30000 rows in sample, 12151060 estimated total rows

Updating statistics did not effect the results -- it's still doing full
table scans (I had run statistics as well before posting here as well so
this was expected).

thank you

On Wed, Jun 2, 2010 at 8:49 PM, Bob Lunney <bob_lunney(at)yahoo(dot)com> wrote:

> Jori,
>
> What is the PostgreSQL
> version/shared_buffers/work_mem/effective_cache_size/default_statistics_target?
> Are the statistics for the table up to date? (Run analyze verbose
> <tablename> to update them.) Table and index structure would be nice to
> know, too.
>
> If all else fails you can set enable_seqscan = off for the session, but
> that is a Big Hammer for what is probably a smaller problem.
>
> Bob Lunney
>
> --- On *Wed, 6/2/10, Jori Jovanovich <jori(at)dimensiology(dot)com>* wrote:
>
>
> From: Jori Jovanovich <jori(at)dimensiology(dot)com>
> Subject: [PERFORM] SELECT ignoring index even though ORDER BY and LIMIT
> present
> To: pgsql-performance(at)postgresql(dot)org
> Date: Wednesday, June 2, 2010, 4:28 PM
>
>
> hi,
>
> I have a problem space where the main goal is to search backward in time
> for events. Time can go back very far into the past, and so the
> table can get quite large. However, the vast majority of queries are all
> satisfied by relatively recent data. I have an index on the row creation
> date and I would like almost all of my queries to have a query plan looking
> something like:
>
> Limit ...
> -> Index Scan Backward using server_timestamp_idx on events
> (cost=0.00..623055.91 rows=8695 width=177)
> ...
>
> However, PostgreSQL frequently tries to do a full table scan. Often what
> controls whether a scan is performed or not is dependent on the size of the
> LIMIT and how detailed the WHERE clause is. In practice, the scan is always
> the wrong answer for my use cases (where "always" is defined to be >99.9%).
>
> Some examples:
>
> (1) A sample query that devolves to a full table scan
>
> EXPLAIN
> SELECT events.id, events.client_duration, events.message,
> events.created_by, events.source, events.type, events.event,
> events.environment,
> events.server_timestamp, events.session_id, events.reference,
> events.client_uuid
> FROM events
> WHERE client_uuid ~* E'^foo bar so what'
> ORDER BY server_timestamp DESC
> LIMIT 20;
> QUERY PLAN (BAD!)
> --------------------------------------------------------------------------
> Limit (cost=363278.56..363278.61 rows=20 width=177)
> -> Sort (cost=363278.56..363278.62 rows=24 width=177)
> Sort Key: server_timestamp
> -> Seq Scan on events (cost=0.00..363278.01 rows=24 width=177)
> Filter: (client_uuid ~* '^foo bar so what'::text)
>
>
> (2) Making the query faster by making the string match LESS specific (odd,
> seems like it should be MORE)
>
> EXPLAIN
> SELECT events.id, events.client_duration, events.message,
> events.created_by, events.source, events.type, events.event,
> events.environment,
> events.server_timestamp, events.session_id, events.reference,
> events.client_uuid
> FROM events
> WHERE client_uuid ~* E'^foo'
> ORDER BY server_timestamp DESC
> LIMIT 20;
> QUERY PLAN (GOOD!)
>
>
> ------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..1433.14 rows=20 width=177)
> -> Index Scan Backward using server_timestamp_idx on events
> (cost=0.00..623055.91 rows=8695 width=177)
> Filter: (client_uuid ~* '^foo'::text)
>
>
> (3) Alternatively making the query faster by using a smaller limit
>
> EXPLAIN
> SELECT events.id, events.client_duration, events.message,
> events.created_by, events.source, events.type, events.event,
> events.environment,
> events.server_timestamp, events.session_id, events.reference,
> events.client_uuid
> FROM events
> WHERE client_uuid ~* E'^foo bar so what'
> ORDER BY server_timestamp DESC
> LIMIT 10;
> QUERY PLAN (GOOD!)
>
>
> ----------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..259606.63 rows=10 width=177)
> -> Index Scan Backward using server_timestamp_idx on events
> (cost=0.00..623055.91 rows=24 width=177)
> Filter: (client_uuid ~* '^foo bar so what'::text)
>
>
> I find myself wishing I could just put a SQL HINT on the query to force the
> index to be used but I understand that HINTs are considered harmful and are
> therefore not provided for PostgreSQL, so what is the recommended way to
> solve this?
>
> thank you very much
>
>
On Thu, Jun 3, 2010 at 5:15 AM, Matthew Wakeling <matthew(at)flymine(dot)org>
wrote:

> On Wed, 2 Jun 2010, Jori Jovanovich wrote:
>
>> (2) Making the query faster by making the string match LESS specific (odd,
>> seems like it should be MORE)
>>
>
> No, that's the way round it should be. The LIMIT changes it all. Consider
> if you have a huge table, and half of the entries match your WHERE clause.
> To fetch the ORDER BY ... LIMIT 20 using an index scan would involve
> accessing only on average 40 entries from the table referenced by the index.
> Therefore, the index is quick. However, consider a huge table that only has
> twenty matching entries. The index scan would need to touch every single row
> in the table to return the matching rows, so a sequential scan, filter, and
> sort would be much faster. Of course, if you had an index capable of
> answering the WHERE clause, that would be even better for that case.
>

Okay, this makes sense, thank you -- I was thinking about it backwards.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2010-06-03 16:52:44 Re: Weird XFS WAL problem
Previous Message Merlin Moncure 2010-06-03 13:01:01 Re: Weird XFS WAL problem