Skip site navigation (1) Skip section navigation (2)

Re: Comparing user attributes with bitwise operators

From: Patrick Clery <patrick(at)phpforhire(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Comparing user attributes with bitwise operators
Date: 2004-09-19 03:26:13
Message-ID: 200409182126.13686.patrick@phpforhire.com (view raw or flat)
Thread:
Lists: pgsql-performance
I have currently implemented a schema for my "Dating Site" that is storing 
user search preferences and user attributes in an int[] array using the 
contrib/intarray package (suggested by Greg Stark). But there are a few 
problems.
	a) query_int can't be cast to int4.
	b) query_int can't be indexed.

datingsite=> alter table people_attributes add column bla query_int;
ALTER TABLE
datingsite=> create index idx_query_int on people_attributes (bla);
ERROR:  data type query_int has no default operator class for access method 
"btree"
HINT:  You must specify an operator class for the index or define a default 
operator class for the data type.
datingsite=> create index idx_query_int on people_attributes (bla 
gist__int_ops);
ERROR:  operator class "gist__int_ops" does not exist for access method 
"btree"
datingsite=> alter table people_attributes drop column bla;
ALTER TABLE

	c) query_int can only be used in one operation against int[]:

	README.intarray:
	int[] @@ query_int  - returns TRUE if array satisfies query (like '1&(2|3)')

 	It is not possible to use >=, <=, =, etc. Also, this operator does not work 
like example says:

datingsite=> select '{2,3}'::int[] @@ '1'::query_int;
 ?column? 
----------
 f
(1 row)

	d) I can't find a way to simply check if an integer is an array without 
declaring it as an array; Therefore, I need to use an int[] type for a column 
that will only be storing one int4 if I want to compare it to an int[] array:

	README.intarray:
	int[] && int[]  - overlap - returns TRUE if arrays has at least one common 
elements.

	e) int[] and query_int are somewhat ugly to deal with since query_int needs 
to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm 
just being anal :)


Because of these limitations, I've chosen to declare the attribute columns as 
int[] arrays (even though they will only contain one value) so that I can use 
'{1,2,3}'::int[] && column_name:

	README.intarray:
	int[] && int[]  - overlap - returns TRUE if arrays has at least one common 
elements.

Here is the schema:

create table people (
    person_id serial,
    datecreated timestamp with time zone default now (),
    signup_ip cidr not null,
    username character varying(30) not null,
    password character varying(28) not null,
    email character varying(65) not null,
    dob date not null,
    primary key (person_id)
);

create table people_attributes (
    person_id int references people (person_id) on delete cascade initially 
deferred,
    askmecount int not null default 0,
    age int[] not null default '{1}'::int[],
    gender int[] not null default '{1}'::int[],
    orientation int[] not null default '{1}'::int[],
    bodytype int[] not null default '{1}'::int[],
    children int[] not null default '{1}'::int[],
    drinking int[] not null default '{1}'::int[],
    education int[] not null default '{1}'::int[],
    ethnicity int[] not null default '{1}'::int[],
    eyecolor int[] not null default '{1}'::int[],
    haircolor int[] not null default '{1}'::int[],
    hairstyle int[] not null default '{1}'::int[],
    height int[] not null default '{1}'::int[],
    income int[] not null default '{1}'::int[],
    occupation int[] not null default '{1}'::int[],
    relation int[] not null default '{1}'::int[], /* multiple answer */
    religion int[] not null default '{1}'::int[],
    seeking int[] not null default '{1}'::int[], /* multiple answer */
    smoking int[] not null default '{1}'::int[],
    want_children int[] not null default '{1}'::int[],
    weight int[] not null default '{1}'::int[],

    primary key (person_id)
)
without oids;

create index people_attributes_search on people_attributes using gist (
    age gist__int_ops,
    gender gist__int_ops,
    orientation gist__int_ops,
    bodytype gist__int_ops,
    children gist__int_ops,
    drinking gist__int_ops,
    education gist__int_ops,
    ethnicity gist__int_ops,
    eyecolor gist__int_ops,
    haircolor gist__int_ops,
    hairstyle gist__int_ops,
    height gist__int_ops,
    income gist__int_ops,
    occupation gist__int_ops,
    relation gist__int_ops,
    religion gist__int_ops,
    seeking gist__int_ops,
    smoking gist__int_ops,
    want_children gist__int_ops,
    weight gist__int_ops
    );

/* These will be compared against the people_attributes table */
create table people_searchprefs (
    person_id int references people (person_id) on delete cascade initially 
deferred,
    age int[] not null default 
'{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[],
    gender int[] not null default '{1,2,4}'::int[],
    orientation int[] not null default '{1,2,8}'::int[],
    bodytype int[] not null default '{1,2,3,4,5,6}'::int[],
    children int[] not null default '{0}'::int[],
    drinking int[] not null default '{0}'::int[],
    education int[] not null default '{0}'::int[],
    ethnicity int[] not null default '{0}'::int[],
    eyecolor int[] not null default '{0}'::int[],
    haircolor int[] not null default '{0}'::int[],
    hairstyle int[] not null default '{0}'::int[],
    height int[] not null default '{0}'::int[],
    income int[] not null default '{0}'::int[],
    occupation int[] not null default '{0}'::int[],
    relation int[] not null default '{0}'::int[],
    religion int[] not null default '{0}'::int[],
    seeking int[] not null default '{0}'::int[],
    smoking int[] not null default '{0}'::int[],
    want_children int[] not null default '{0}'::int[],
    weight int[] not null default '{0}'::int[],

    primary key (person_id)
)
without oids;


And now, the moment you've all been waiting for: performance!


(Number of profiles)

datingsite=> select count(*) from people_attributes ;
 count 
-------
 96146
(1 row)

(age, gender and sexual orientation will always be a part of the query, and 
are necessary to be invoke the index. The query is to show females, age 30-40 
of any orientation. But first, without the index)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on people_attributes  (cost=0.00..9078.56 rows=1 width=36) (actual 
time=0.044..299.537 rows=937 loops=1)
   Filter: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND 
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 304.707 ms
(3 rows)


( with the index )

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
                                                                     QUERY 
PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using people_attributes_search on people_attributes  
(cost=0.00..6.02 rows=1 width=36) (actual time=0.064..52.383 rows=937 
loops=1)
   Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND 
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 57.032 ms
(3 rows)

(more realistically, it will have a limit of 10)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation limit 10;

                                                                        QUERY 
PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..6.02 rows=1 width=36) (actual time=0.235..0.651 rows=10 
loops=1)
   ->  Index Scan using people_attributes_search on people_attributes  
(cost=0.00..6.02 rows=1 width=36) (actual time=0.224..0.550 rows=10 loops=1)
         Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) 
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 0.817 ms
(4 rows)

(slower with an sort key)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age;
                                                                        QUERY 
PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=62.572..66.338 rows=937 
loops=1)
   Sort Key: age
   ->  Index Scan using people_attributes_search on people_attributes  
(cost=0.00..6.02 rows=1 width=68) (actual time=0.223..55.999 rows=937 
loops=1)
         Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) 
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
 Total runtime: 71.206 ms
(5 rows)

(no better with a limit)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age limit 10;
                                                                           QUERY 
PLAN                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10 
loops=1)
   ->  Sort  (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418 
rows=10 loops=1)
         Sort Key: age
         ->  Index Scan using people_attributes_search on people_attributes  
(cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937 
loops=1)
               Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] 
&& age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && 
orientation))
 Total runtime: 69.899 ms
(6 rows)

The last query is the most likely since I will need to be sorting by some key. 
If I wasn't sorting it looks like it wouldn't be too bad, but sorting is 
inevitable I think.  I've only imported  96,146 of the 150,000 profiles. This 
seems a bit slow now, and it doesn't look like it will scale. 

My questions are:

- Is there a way of speeding up the sort?
- Will using queries like " WHERE orientation IN (1,2,4) " be any 
better/worse?
- The queries with the GiST index are faster, but is it of any benefit when 
the int[] arrays all contain a single value?
- Is there any hope for this structure?

Thanks for the suggestion Greg, and thanks to those who responded to this 
thread. 



On Thursday 16 September 2004 02:44, Greg Stark wrote:
> The only kind of index that is capable of indexing this type of data
> structure for arbitrary searches would be a GiST index. I'm not aware of
> any implementation for bitfields, though it would be an appropriate use.
>
> What there is now is the contrib/intarray package. You would have to store
> more than just the bitfields, you would have to store an array of integer
> flags. That might be denser actually if you end up with many flags few of
> which are set.
>
> GiST indexes allow you to search arbitrary combinations of set and unset
> flags. using the "@@" operator
>
>   int[] @@ query_int  - returns TRUE if array satisfies query (like
> '1&(2|3)')
>
> You might be able to look at the code there and adapt it to apply to bit
> fields. If so I think it would be a useful tool. But GiST indexing is
> pretty esoteric stuff.



    primary key (person_id)
)
without oids;


In response to

Responses

pgsql-performance by date

Next:From: Greg StarkDate: 2004-09-19 05:07:56
Subject: Re: Comparing user attributes with bitwise operators
Previous:From: Steinar H. GundersonDate: 2004-09-18 20:03:36
Subject: Re: Planner having way wrong estimate for group aggregate

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group