Re: Index ignored on column containing mostly 0 values

From: Andreas Kostyrka <andreas(at)kostyrka(dot)org>
To: Leif Mortenson <leiflists(at)tanukisoftware(dot)com>
Cc: Postgres Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Index ignored on column containing mostly 0 values
Date: 2006-10-31 15:03:55
Message-ID: 1162307035.18283.3.camel@andi-lap
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Am Dienstag, den 31.10.2006, 13:04 +0900 schrieb Leif Mortenson:
> Hello,
> I have been having a problem with the following query ignoring an index
> on the foos.bar column.
>
> SELECT c.id
> FROM foos c, bars r
> WHERE r.id != 0
> AND r.modified_time > '2006-10-20 10:00:00.000'
> AND r.modified_time <= '2006-10-30 15:20:00.000'
> AND c.bar = r.id
>
> The bars table contains 597 rows, while the foos table contains 5031203
> rows.
>
> After much research I figured out that the problem is being caused by the
> PG planner deciding that my foos.bar index is not useful. The data in the
> foos.bar column contains 5028698 0 values and 2505 that are ids in the bars
> table.
>
> Both tables have just been analyzed.
>
> When I EXPLAIN ANALYZE the above query, I get the following:
>
> "Hash Join (cost=3.06..201642.49 rows=25288 width=8) (actual
> time=0.234..40025.514 rows=11 loops=1)"
> " Hash Cond: ("outer".bar = "inner".id)"
> " -> Seq Scan on foos c (cost=0.00..176225.03 rows=5032303 width=16)
> (actual time=0.007..30838.623 rows=5031203 loops=1)"
> " -> Hash (cost=3.06..3.06 rows=3 width=8) (actual time=0.117..0.117
> rows=20 loops=1)"
> " -> Index Scan using bars_index_modified_time on bars r
> (cost=0.00..3.06 rows=3 width=8) (actual time=0.016..0.066 rows=20 loops=1)"
> " Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
> time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
> without time zone))"
> " Filter: (id <> 0)"
> "Total runtime: 40025.629 ms"
>
> The solution I found was to change the statistics on my foos.bar column from
> the default -1 to 1000. When I do this, reanalyze the table, and rerun
> the above
> query, I get the following expected result.
>
> "Nested Loop (cost=0.00..25194.66 rows=25282 width=8) (actual
> time=13.035..23.338 rows=11 loops=1)"
> " -> Index Scan using bars_index_modified_time on bars r
> (cost=0.00..3.06 rows=3 width=8) (actual time=0.063..0.115 rows=20 loops=1)"
> " Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
> time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
> without time zone))"
> " Filter: (id <> 0)"
> " -> Index Scan using foos_index_bar on foos c (cost=0.00..6824.95
> rows=125780 width=16) (actual time=1.141..1.152 rows=1 loops=20)"
> " Index Cond: (c.bar = "outer".id)"
> "Total runtime: 23.446 ms"
>
> Having to do this concerns me as I am not sure what a good statistics value
> should be. Also we expect this table to grow much larger and I am concerned
> that it may not continue to function correctly. I tried a value of 100
> and that
> works when the number of bars records is small, but as soon as I increase
> them, the query starts ignoring the index again.
>
> Is increasing the statistics value the best way to resolve this problem? How
> can I best decide on a good statistics value?
>
> Having a column containing large numbers of null or 0 values seems fairly
> common. Is there way to tell Postgres to create an index of all values with
> meaning. Ie all non-0 values? None that I could find.
Have you tried

CREATE INDEX partial ON foos (bar) WHERE bar IS NOT NULL;

Andreas

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Worky Workerson 2006-10-31 20:11:00 Re: Best COPY Performance
Previous Message Richard Huxton 2006-10-31 11:47:17 Re: client crashes in PQfinish