Re: varchar index joins not working?

From: Adam Gundy <adam(at)starsilk(dot)net>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: varchar index joins not working?
Date: 2008-04-14 17:02:25
Message-ID: 48038E21.50606@starsilk.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Richard Huxton wrote:
> Adam Gundy wrote:
>> On Thu, Apr 10, 2008 at 8:52 AM, Adam Gundy <adam(at)starsilk(dot)net> wrote:
>>> Richard Huxton wrote:
>>>> How many distinct values do you have in groups.groupid and
>>> group_access.group_id?
>>> for the small database (since it shows the same problem):
>>>
>>> group_access: 280/268
>>> groups: 2006/139
>>>
>>> for the large database:
>>>
>>> group_access: same
>>> groups: 1712647/140
>>>
>>> the groupid key is an MD5 hash, so it should be uniformly distributed.
>>> maybe that throws the stats? but, again, char works, varchar doesn't :-(
>>
>> OK, I'm thinking the varchar/char part is not the issue.
>
> Good, because it's not :-)

hmm. unfortunately it did turn out to be (part) of the issue. I've
discovered that mixing char and varchar in a stored procedure does not
coerce the types, and ends up doing seq scans all the time.

changing something like this:

proc x ( y char(32) )
{
select * from groups where groupid = y
}

into this:

proc x ( y varchar(32) )
{
select * from groups where groupid = y
}

and suddenly postgres does index lookups in the stored proc... way faster.

>> I *think* that when I analyze using char instead of varchar, it is
>> recording a stat for the large group, but for some reason with varchar
>> doesn't add a stat for that one.
>>
>> so, the real question is, how do I fix this? I can turn the stats way
>> up to 1000, but that doesn't guarantee that I'll get a stat for the
>> large group :-(
>
> Well, by default it will be tracking the 10 most common values (and how
> often they occur). As you say, this can be increased to 1000 (although
> it obviously takes longer to check 1000 rather than 10).
>
> We can have a look at the stats with something like:
> SELECT * FROM pg_stats WHERE tablename='group_access' AND attname='uid';
> You'll be interested in n_distinct, most_common_vals and most_common_freqs.
>
> However, I think the problem may be that PG doesn't track cross-column
> stats, so it doesn't know that a particular uid implies one or more
> particular groupid values.

I doubt we could get stats stable enough for this. the number of groups
will hopefully be much larger at some point.

it's a shame the index entries can't be used to provide information to
the planner, eg a rough count of the number of entries for a given key
(or subset). it would be nice to be able to create eg a counted btree
when you know you have this kind of data as a hint to the planner.

>> can I turn the statistics off completely for this column? I'm guessing
>> that if I can, that will mean it takes a guess based on the number of
>> distinct values in the groups table, which is still large number of
>> records, possibly enough to trigger the seqscan anyway.
>
> No - can't disable stats. Besides, you want it the other way around -
> index scans for all groups except the largest.

actually, disabling seqscan at the server level gives extremely good
response times. I ended up rewriting a few queries that were scanning
the whole group for no good reason, and bitmap index hashing seems to
take care of things nicely.

queries have gone from 30+ seconds to < 0.1 seconds.

>> does postgres have a way of building a 'counted index' that the
>> planner can use for it's record counts? some way of forcibly
>> maintaining a stat for every group?
>
> No, but let's see what's in pg_stats.

no real help there. either it hits the group being read, and does a good
plan, or it doesn't, and tries to seqscan (unless I disable it). even
forcing stats to 1000 only bandaids the situation, given the number of
groups will eventually exceed that..

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-04-14 17:46:06 Re: varchar index joins not working?
Previous Message Greg Smith 2008-04-14 15:44:44 Re: shared_buffers performance