Re: varchar index joins not working?

From: Richard Huxton <dev(at)archonet(dot)com>
To: Adam Gundy <adam(at)starsilk(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: varchar index joins not working?
Date: 2008-04-10 08:46:09
Message-ID: 47FDD3D1.9070208@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Adam Gundy wrote:
> I'm hitting an unexpected problem with postgres 8.3 - I have some
> tables which use varchar(32) for their unique IDs which I'm attempting
> to join using some simple SQL:
>
> select *
> from group_access, groups
> where group_access.groupid = groups.groupid and
> group_access.uid = '7275359408f44591d0717e16890ce335';
>
> there's a unique index on group_access.groupid, and a non-unique index
> on groups.groupid. both are non-null.

What about group_access.uid - I'd have thought that + groups pkey is
probably the sensible combination here.

> the problem is: if groupid (in both tables) is varchar, I cannot force
> postgres (no matter how hard I try) to do an index scan. it ends up
> reading the entire groups table (pretty large!):

OK

> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=8.89..41329.88 rows=119940 width=287) (actual
> time=0.202..935.136 rows=981 loops=1)

That's because it's expecting 119,940 rows to match (rather than the
actual 981 you do get). If you were getting that many results this is
probably a sensible plan.

> Hash Cond: ((groups.groupid)::text = (group_access.groupid)::text)
> -> Seq Scan on groups (cost=0.00..31696.48 rows=1123348
> width=177) (actual time=0.011..446.091 rows=1125239 loops=1)

It's got a good idea of the total number of rows in groups.

> -> Hash (cost=8.51..8.51 rows=30 width=110) (actual
> time=0.148..0.148 rows=30 loops=1)
> -> Seq Scan on group_access (cost=0.00..8.51 rows=30
> width=110) (actual time=0.014..0.126 rows=30 loops=1)

And also group_access. Oh, the seq-scan doesn't really matter here. It
probably *is* faster to read all 30 rows in one burst rather than go to
the index and then back to the table.

> Filter: ((uid)::text = '7275359408f44591d0717e16890ce335'::text)
> Total runtime: 935.443 ms
> (7 rows)
>
> if I disable seq_scan, I get this:
>
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------
> Merge Join (cost=1.47..106189.61 rows=120004 width=287) (actual
> time=0.100..1532.353 rows=981 loops=1)

It's still thinking it's going to get 120 thousand rows.

> it's running an index scan across the entire table (no condition applied) :-(
>
> so, just for the hell of it, I tried making groupid a char(32),
> despite repeated assertions in this group that there's no performance
> difference between the two:

There's no performance difference between the two.

> Nested Loop (cost=4.48..253.85 rows=304 width=291) (actual
> time=0.715..22.906 rows=984 loops=1)

> (this last plan is actually against a smaller test DB, but I get the
> same behavior with it, seq scan for varchar or index scan for char,
> and the results returned are identical for this query)

The char(32) thing isn't important here, what is important is that it's
expecting ~300 rows rather than 120,000. It's still wrong, but it's
close enough to make sense.

So - the question is - why is PG expecting so many matches to your join.
How many distinct values do you have in groups.groupid and
group_access.group_id?

--
Richard Huxton
Archonet Ltd

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew 2008-04-10 09:51:13 Re: large tables and simple "= constant" queries using indexes
Previous Message PFC 2008-04-10 08:25:48 Re: large tables and simple "= constant" queries using indexes