Re: I'd like to learn a bit more about how indexes work

From: Chris Curvey <chris(at)chriscurvey(dot)com>
To: Mike Christensen <mike(at)kitchenpc(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: I'd like to learn a bit more about how indexes work
Date: 2012-06-06 02:20:25
Message-ID: CADfwSsATjdD_WiBq50wrvZ9SiHV4gNWVYHwayNpUtTtwkpT9WQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <mike(at)kitchenpc(dot)com> wrote:

> Hi -
>
> I'm trying to increase my general knowledge about how indexes work in
> databases. Though my questions are probably general and implemented
> in a similar way across major relational DBs, I'm also curious as to
> how they're implemented in Postgres specifically (mainly because I
> like PG, and am always interested in knowing if PG does things in some
> cool and interesting way).
>

Quick! Create some test data!

drop table if exists foobar;
create table foobar
( a int not null primary key
, b int null
, c int null
, d int null
);
create index b_idx on foobar (b);
create index c_idx on foobar (c);
create index d_idx on foobar (d);

create or replace function generate_test_data() returns void as $$
declare
i integer;
begin
for i in 1..100000 loop
insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000);
end loop;
end;
$$ language plpgsql;

select generate_test_data();

vacuum analyze foobar;

>
> I know the basics of how binary trees work, so I understand a query such
> as :
>
> select * from Table where Id = 5;
>
> Provided Id has a btree index on it.

sometimes. Sometimes not.

explain analyze
select * from foobar
where a = 1234;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
Index Scan using foobar_pkey on foobar (cost=0.00..8.38 rows=1 width=16)
(actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (a = 1234)
Total runtime: 0.030 ms
(3 rows)

explain analyze
select *
from foobar
where b = 1;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..1791.00 rows=50270 width=16) (actual
time=0.011..13.603 rows=50000 loops=1)
Filter: (b = 1)
Rows Removed by Filter: 50000
Total runtime: 16.005 ms
(4 rows)

> I'm curious as to how indexes
> are used with OR and AND clauses.
>
> Something like:
>
> select * from Table where X = 5 or y = 3;
>
> It seems to me both the index of X would be scanned and those rows
> would be loaded into memory, and then the index of Y would be scanned
> and loaded. Then, Postgres would have to merge both sets into a set
> of unique rows. Is this pretty much what's going on? Let's ignore
> table stats for now.
>

The right answer is "sometimes". Here's a query that is solved the way you
expect:

explain analyze
select *
from foobar
where c = 1
or d = 1;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foobar (cost=226.47..920.65 rows=10202 width=16)
(actual time=1.284..3.784 rows=10000 loops=1)
Recheck Cond: ((c = 1) OR (d = 1))
-> BitmapOr (cost=226.47..226.47 rows=10212 width=0) (actual
time=1.174..1.174 rows=0 loops=1)
-> Bitmap Index Scan on c_idx (cost=0.00..216.23 rows=10113
width=0) (actual time=1.157..1.157 rows=10000 loops=1)
Index Cond: (c = 1)
-> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0)
(actual time=0.016..0.016 rows=100 loops=1)
Index Cond: (d = 1)
Total runtime: 4.452 ms
(8 rows)

And here is a query that is not solved the way you expect, because the
index on B is not useful.

explain analyze
select *
from foobar
where c = 1
or b = 1;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2041.00 rows=55299 width=16) (actual
time=0.007..14.922 rows=50000 loops=1)
Filter: ((c = 1) OR (b = 1))
Rows Removed by Filter: 50000
Total runtime: 17.002 ms
(4 rows)

>
> Then, something like:
>
> select * from Table where X = 5 AND y = 3;
>
> I would imagine the same thing is going on, only Postgres would find
> rows that appear in both sets. I also imagine Postgres might create a
> hash table from the larger set, and then iterate through the smaller
> set looking for rows that were in that hash table

and you would be largely right about that. Interestingly, on an earlier
run of this, I got a BitmapAnd strategy, rather than applying the c=1
filter to the rows after identifying all the d=1 rows.

explain analyze
select *
from foobar
where c = 1
and d = 1;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foobar (cost=5.13..256.48 rows=10 width=16) (actual
time=0.046..0.150 rows=100 loops=1)
Recheck Cond: (d = 1)
Filter: (c = 1)
-> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0)
(actual time=0.026..0.026 rows=100 loops=1)
Index Cond: (d = 1)
Total runtime: 0.179 ms
(6 rows)

>
> Lastly, If you had a query such as:
>
> select * from Table where X IN (1,2,3,4,5,6,7);
>
> I would imagine Postgres would parse that query as a bunch of OR
> clauses. Does this mean the index for X would be scanned 7 times and
> merged into a set of unique results? Though, obviously if Postgres
> estimated this would return the majority of the rows in the table, it
> would probably just ignore the index completely.
>

and you would be right on both counts

explain analyze
select *
from foobar
where c in (1, 2, 3);
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foobar (cost=609.14..1562.18 rows=29967 width=16)
(actual time=3.456..7.589 rows=30000 loops=1)
Recheck Cond: (c = ANY ('{1,2,3}'::integer[]))
-> Bitmap Index Scan on c_idx (cost=0.00..601.64 rows=29967 width=0)
(actual time=3.342..3.342 rows=30000 loops=1)
Index Cond: (c = ANY ('{1,2,3}'::integer[]))
Total runtime: 9.083 ms
(5 rows)

explain analyze
select *
from foobar
where c in (1, 2, 3, 4, 5, 6);
QUERY PLAN

-------------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2291.00 rows=59723 width=16) (actual
time=0.005..18.450 rows=60000 loops=1)
Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[]))
Rows Removed by Filter: 40000
Total runtime: 21.181 ms
(4 rows)

> Thanks!
> Mike
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2012-06-06 02:35:47 Re: problem after upgrade db missing
Previous Message Adrian Klaver 2012-06-06 02:15:21 Re: Populate Table From Two Other Tables