Quick Links

Re: I'd like to learn a bit more about how indexes work

From:	Mike Christensen <mike(at)kitchenpc(dot)com>
To:	chris(at)chriscurvey(dot)com
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: I'd like to learn a bit more about how indexes work
Date:	2012-06-06 05:50:16
Message-ID:	CABs1bs3p2tE7=5wmvqNx0Zs71ufi8bKC3f1uyzovM0NYFtF8+w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Tue, Jun 5, 2012 at 7:20 PM, Chris Curvey <chris(at)chriscurvey(dot)com> wrote:
>
>
> On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <mike(at)kitchenpc(dot)com> wrote:
>>
>> Hi -
>>
>> I'm trying to increase my general knowledge about how indexes work in
>> databases. Though my questions are probably general and implemented
>> in a similar way across major relational DBs, I'm also curious as to
>> how they're implemented in Postgres specifically (mainly because I
>> like PG, and am always interested in knowing if PG does things in some
>> cool and interesting way).
>
>
> Quick! Create some test data!
>
> drop table if exists foobar;
> create table foobar
> ( a int not null primary key
> , b int null
> , c int null
> , d int null
> );
> create index b_idx on foobar (b);
> create index c_idx on foobar (c);
> create index d_idx on foobar (d);
>
> create or replace function generate_test_data() returns void as $$
> declare
> i integer;
> begin
> for i in 1..100000 loop
> insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000);
> end loop;
> end;
> $$ language plpgsql;
>
> select generate_test_data();
>
> vacuum analyze foobar;
>
>>
>>
>> I know the basics of how binary trees work, so I understand a query such
>> as :
>>
>> select * from Table where Id = 5;
>>
>> Provided Id has a btree index on it.
>
>
> sometimes. Sometimes not.
>
> explain analyze
> select * from foobar
> where a = 1234;
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------
> Index Scan using foobar_pkey on foobar (cost=0.00..8.38 rows=1 width=16)
> (actual time=0.008..0.008 rows=1 loops=1)
> Index Cond: (a = 1234)
> Total runtime: 0.030 ms
> (3 rows)
>
> explain analyze
> select *
> from foobar
> where b = 1;
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
> Seq Scan on foobar (cost=0.00..1791.00 rows=50270 width=16) (actual
> time=0.011..13.603 rows=50000 loops=1)
> Filter: (b = 1)
> Rows Removed by Filter: 50000
> Total runtime: 16.005 ms
> (4 rows)
>
>>
>> I'm curious as to how indexes
>> are used with OR and AND clauses.
>>
>> Something like:
>>
>> select * from Table where X = 5 or y = 3;
>>
>> It seems to me both the index of X would be scanned and those rows
>> would be loaded into memory, and then the index of Y would be scanned
>> and loaded. Then, Postgres would have to merge both sets into a set
>> of unique rows. Is this pretty much what's going on? Let's ignore
>> table stats for now.
>
>
> The right answer is "sometimes". Here's a query that is solved the way you
> expect:
>
> explain analyze
> select *
> from foobar
> where c = 1
> or d = 1;
> QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foobar (cost=226.47..920.65 rows=10202 width=16)
> (actual time=1.284..3.784 rows=10000 loops=1)
> Recheck Cond: ((c = 1) OR (d = 1))
> -> BitmapOr (cost=226.47..226.47 rows=10212 width=0) (actual
> time=1.174..1.174 rows=0 loops=1)
> -> Bitmap Index Scan on c_idx (cost=0.00..216.23 rows=10113
> width=0) (actual time=1.157..1.157 rows=10000 loops=1)
> Index Cond: (c = 1)
> -> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0)
> (actual time=0.016..0.016 rows=100 loops=1)
> Index Cond: (d = 1)
> Total runtime: 4.452 ms
> (8 rows)
>
> And here is a query that is not solved the way you expect, because the index
> on B is not useful.
>
> explain analyze
> select *
> from foobar
> where c = 1
> or b = 1;
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
> Seq Scan on foobar (cost=0.00..2041.00 rows=55299 width=16) (actual
> time=0.007..14.922 rows=50000 loops=1)
> Filter: ((c = 1) OR (b = 1))
> Rows Removed by Filter: 50000
> Total runtime: 17.002 ms
> (4 rows)
>
>
>>
>>
>> Then, something like:
>>
>> select * from Table where X = 5 AND y = 3;
>>
>> I would imagine the same thing is going on, only Postgres would find
>> rows that appear in both sets. I also imagine Postgres might create a
>> hash table from the larger set, and then iterate through the smaller
>> set looking for rows that were in that hash table
>
>
> and you would be largely right about that. Interestingly, on an earlier run
> of this, I got a BitmapAnd strategy, rather than applying the c=1 filter to
> the rows after identifying all the d=1 rows.
>
> explain analyze
> select *
> from foobar
> where c = 1
> and d = 1;
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foobar (cost=5.13..256.48 rows=10 width=16) (actual
> time=0.046..0.150 rows=100 loops=1)
> Recheck Cond: (d = 1)
> Filter: (c = 1)
> -> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0) (actual
> time=0.026..0.026 rows=100 loops=1)
> Index Cond: (d = 1)
> Total runtime: 0.179 ms
> (6 rows)
>
>
>>
>>
>> Lastly, If you had a query such as:
>>
>> select * from Table where X IN (1,2,3,4,5,6,7);
>>
>> I would imagine Postgres would parse that query as a bunch of OR
>> clauses. Does this mean the index for X would be scanned 7 times and
>> merged into a set of unique results? Though, obviously if Postgres
>> estimated this would return the majority of the rows in the table, it
>> would probably just ignore the index completely.
>
>
> and you would be right on both counts
>
> explain analyze
> select *
> from foobar
> where c in (1, 2, 3);
> QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on foobar (cost=609.14..1562.18 rows=29967 width=16)
> (actual time=3.456..7.589 rows=30000 loops=1)
> Recheck Cond: (c = ANY ('{1,2,3}'::integer[]))
> -> Bitmap Index Scan on c_idx (cost=0.00..601.64 rows=29967 width=0)
> (actual time=3.342..3.342 rows=30000 loops=1)
> Index Cond: (c = ANY ('{1,2,3}'::integer[]))
> Total runtime: 9.083 ms
> (5 rows)
>
> explain analyze
> select *
> from foobar
> where c in (1, 2, 3, 4, 5, 6);
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
> Seq Scan on foobar (cost=0.00..2291.00 rows=59723 width=16) (actual
> time=0.005..18.450 rows=60000 loops=1)
> Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[]))
> Rows Removed by Filter: 40000
> Total runtime: 21.181 ms
> (4 rows)

Thanks! One thing that still confuses me is the difference between IN
and OR. With this query:

explain analyze
select *
from foobar
where d in (500, 750);

It scans the d index only once:

'Bitmap Heap Scan on foobar (cost=10.03..400.63 rows=196 width=16)
(actual time=0.128..0.489 rows=200 loops=1)'
' Recheck Cond: (d = ANY ('{500,750}'::integer[]))'
' -> Bitmap Index Scan on d_idx (cost=0.00..9.98 rows=196 width=0)
(actual time=0.086..0.086 rows=200 loops=1)'
' Index Cond: (d = ANY ('{500,750}'::integer[]))'
'Total runtime: 0.592 ms'

And if I change it to:

explain analyze
select *
from foobar
where d = 500 or d = 750;

Then the d index is scanned twice and it uses a BItmapOr:

'Bitmap Heap Scan on foobar (cost=10.10..401.18 rows=196 width=16)
(actual time=0.118..0.480 rows=200 loops=1)'
' Recheck Cond: ((d = 500) OR (d = 750))'
' -> BitmapOr (cost=10.10..10.10 rows=196 width=0) (actual
time=0.079..0.079 rows=0 loops=1)'
' -> Bitmap Index Scan on d_idx (cost=0.00..5.00 rows=98
width=0) (actual time=0.047..0.047 rows=100 loops=1)'
' Index Cond: (d = 500)'
' -> Bitmap Index Scan on d_idx (cost=0.00..5.00 rows=98
width=0) (actual time=0.030..0.030 rows=100 loops=1)'
' Index Cond: (d = 750)'
'Total runtime: 0.611 ms'

Shouldn't these two queries be the same? I'm curious as to why the OR
operator will do two Bitmap Index Scans, then OR then together - while
the IN clause will do a single Index scan that gets all 200 rows at
once. Probably the answer is just "Because that's the way it was
written, deal."

Anyway, I don't want to be too much of a vampire. I'll keep reading
stuff online as well. I found some old Tom Lane posts that are
helpful, but a lot of it is over my head still.

Mike

In response to

Re: I'd like to learn a bit more about how indexes work at 2012-06-06 02:20:25 from Chris Curvey

Responses

Re: I'd like to learn a bit more about how indexes work at 2012-06-06 14:05:55 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Manoj Govindassamy	2012-06-06 06:54:30	Postgres 9.1 Synchronous Replication and stuck queries during sync repl setup
Previous Message	Darren Duncan	2012-06-06 05:19:51	Re: Populate Table From Two Other Tables