Skip site navigation (1) Skip section navigation (2)

Re: varchar index joins not working?

From: Adam Gundy <adam(at)starsilk(dot)net>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: varchar index joins not working?
Date: 2008-04-14 17:02:25
Message-ID: 48038E21.50606@starsilk.net (view raw or flat)
Thread:
Lists: pgsql-performance
Richard Huxton wrote:
> Adam Gundy wrote:
>> On Thu, Apr 10, 2008 at 8:52 AM, Adam Gundy <adam(at)starsilk(dot)net> wrote:
>>> Richard Huxton wrote:
>>>> How many distinct values do you have in groups.groupid and
>>> group_access.group_id?
>>>  for the small database (since it shows the same problem):
>>>
>>>  group_access: 280/268
>>>  groups: 2006/139
>>>
>>>  for the large database:
>>>
>>>  group_access: same
>>>  groups: 1712647/140
>>>
>>>  the groupid key is an MD5 hash, so it should be uniformly distributed.
>>> maybe that throws the stats? but, again, char works, varchar doesn't :-(
>>
>> OK, I'm thinking the varchar/char part is not the issue.
> 
> Good, because it's not :-)

hmm. unfortunately it did turn out to be (part) of the issue. I've 
discovered that mixing char and varchar in a stored procedure does not 
coerce the types, and ends up doing seq scans all the time.

changing something like this:

proc x ( y char(32) )
{
    select * from groups where groupid = y
}

into this:

proc x ( y varchar(32) )
{
    select * from groups where groupid = y
}

and suddenly postgres does index lookups in the stored proc... way faster.

>> I *think* that when I analyze using char instead of varchar, it is
>> recording a stat for the large group, but for some reason with varchar
>> doesn't add a stat for that one.
>>
>> so, the real question is, how do I fix this? I can turn the stats way
>> up to 1000, but that doesn't guarantee that I'll get a stat for the
>> large group :-(
> 
> Well, by default it will be tracking the 10 most common values (and how 
> often they occur). As you say, this can be increased to 1000 (although 
> it obviously takes longer to check 1000 rather than 10).
> 
> We can have a look at the stats with something like:
> SELECT * FROM pg_stats WHERE tablename='group_access' AND attname='uid';
> You'll be interested in n_distinct, most_common_vals and most_common_freqs.
> 
> However, I think the problem may be that PG doesn't track cross-column 
> stats, so it doesn't know that a particular uid implies one or more 
> particular groupid values.

I doubt we could get stats stable enough for this. the number of groups 
will hopefully be much larger at some point.

it's a shame the index entries can't be used to provide information to 
the planner, eg a rough count of the number of entries for a given key 
(or subset). it would be nice to be able to create eg a counted btree 
when you know you have this kind of data as a hint to the planner.

>> can I turn the statistics off completely for this column? I'm guessing
>> that if I can, that will mean it takes a guess based on the number of
>> distinct values in the groups table, which is still large number of
>> records, possibly enough to trigger the seqscan anyway.
> 
> No - can't disable stats. Besides, you want it the other way around - 
> index scans for all groups except the largest.

actually, disabling seqscan at the server level gives extremely good 
response times. I ended up rewriting a few queries that were scanning 
the whole group for no good reason, and bitmap index hashing seems to 
take care of things nicely.

queries have gone from 30+ seconds to < 0.1 seconds.

>> does postgres have a way of building a 'counted index' that the
>> planner can use for it's record counts? some way of forcibly
>> maintaining a stat for every group?
> 
> No, but let's see what's in pg_stats.

no real help there. either it hits the group being read, and does a good 
plan, or it doesn't, and tries to seqscan (unless I disable it). even 
forcing stats to 1000 only bandaids the situation, given the number of 
groups will eventually exceed that..

In response to

Responses

pgsql-performance by date

Next:From: Tom LaneDate: 2008-04-14 17:46:06
Subject: Re: varchar index joins not working?
Previous:From: Greg SmithDate: 2008-04-14 15:44:44
Subject: Re: shared_buffers performance

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group