Skip site navigation (1) Skip section navigation (2)

Operator Classes and ANALYZE

From: Rusty Conover <rconover(at)infogears(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Operator Classes and ANALYZE
Date: 2006-11-03 22:16:35
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-bugs
Hi All,

It doesn't appear that ANALYZE uses the specified operator class for  
producing statistics on an index when that operator class is not the  
default for the data type.  This appears to be leading to poor query  

I'm using:

# select version();
1 ]--------------------------------------------------------------------- 
version | PostgreSQL 8.1.5 on i686-pc-linux-gnu, compiled by GCC gcc  
(GCC) 4.1.0 20060304 (Red Hat 4.1.0-3)

For speed of indexing a few million urls I'm indexing them with a  
custom data type. The data type aptly named "urlhash" has:

A 2 byte host id (just the hostname, not many hosts are involved so  
this is fine)
A 2 byte locality id (parts of the path)
A 16 byte MD5 of the full url all in a custom data type called aptly.

For a total length of 20 bytes.  By default the text output of the  
data type is of the form of ([data in hex]).

Everything is nice and fast (due to good locality) for the default  
operator class that respects the entire data type.  Everything is  
unique so there is an n_distinct of -1 in pg_stats for the index.

Now, I created an operator class for urlhash that just does  
comparisons on the host id "urlhash_host_ops" and I created the index  

# create index url_uh2_idx on url using btree  (texttourlhash(url)  

After running vacuum analyze that statistics for the new index using  
the second operator class still say that every value is unique.  It  
appears it's using the default operator class to produce the  
statistics.  By doing so, the planner doesn't work great since it  
assumes that only one row will be returned.

Note: I prefixed all default operators with @ to denote the host class.

# explain select * from url where url @= 'http://'::urlhash ;
                                        QUERY PLAN
Index Scan using url_uh2_idx on url  (cost=0.00..6.02 rows=1 width=105)
    Index Cond: ((url)::urlhash @=  

# select count(*) from url where url @= 'http://'::urlhash;

There are hosts with orders of magnitude difference in total number  
of rows (100,000 or more), this was just chosen as an example.

Here are the stats for the index:

# select * from pg_stats where tablename = 'url_uh2_idx';
-[ RECORD 1 ]----- 
schemaname        | public
tablename         | url_uh2_idx
attname           | pg_expression_1
null_frac         | 0
avg_width         | 20
n_distinct        | -1
most_common_vals  |
most_common_freqs |
histogram_bounds  | {(11890a55087088877a5a01ea5b8a8459e62491c9), 
correlation       | -0.109073

Should I just create a cast function rather then an entire operator  
class that just respects the host part?  Is this bad style?  If you  
have questions I'd be happy to explain more.


Rusty Conover
InfoGears Inc.


pgsql-bugs by date

Next:From: Harald Armin MassaDate: 2006-11-04 00:17:30
Subject: Re: BUG #2731: Cannot install PostgreSQL server on WinXP Media Center Edition
Previous:From: Jeff DavisDate: 2006-11-03 20:25:22
Subject: Re: COPY fails on 8.1 with invalid byte sequences in text

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group