Re: Comparing user attributes with bitwise operators

From: Patrick Clery <patrick(at)phpforhire(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Comparing user attributes with bitwise operators
Date: 2004-09-19 03:26:13
Message-ID: 200409182126.13686.patrick@phpforhire.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have currently implemented a schema for my "Dating Site" that is storing
user search preferences and user attributes in an int[] array using the
contrib/intarray package (suggested by Greg Stark). But there are a few
problems.
a) query_int can't be cast to int4.
b) query_int can't be indexed.

datingsite=> alter table people_attributes add column bla query_int;
ALTER TABLE
datingsite=> create index idx_query_int on people_attributes (bla);
ERROR: data type query_int has no default operator class for access method
"btree"
HINT: You must specify an operator class for the index or define a default
operator class for the data type.
datingsite=> create index idx_query_int on people_attributes (bla
gist__int_ops);
ERROR: operator class "gist__int_ops" does not exist for access method
"btree"
datingsite=> alter table people_attributes drop column bla;
ALTER TABLE

c) query_int can only be used in one operation against int[]:

README.intarray:
int[] @@ query_int - returns TRUE if array satisfies query (like '1&(2|3)')

It is not possible to use >=, <=, =, etc. Also, this operator does not work
like example says:

datingsite=> select '{2,3}'::int[] @@ '1'::query_int;
?column?
----------
f
(1 row)

d) I can't find a way to simply check if an integer is an array without
declaring it as an array; Therefore, I need to use an int[] type for a column
that will only be storing one int4 if I want to compare it to an int[] array:

README.intarray:
int[] && int[] - overlap - returns TRUE if arrays has at least one common
elements.

e) int[] and query_int are somewhat ugly to deal with since query_int needs
to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm
just being anal :)

Because of these limitations, I've chosen to declare the attribute columns as
int[] arrays (even though they will only contain one value) so that I can use
'{1,2,3}'::int[] && column_name:

README.intarray:
int[] && int[] - overlap - returns TRUE if arrays has at least one common
elements.

Here is the schema:

create table people (
person_id serial,
datecreated timestamp with time zone default now (),
signup_ip cidr not null,
username character varying(30) not null,
password character varying(28) not null,
email character varying(65) not null,
dob date not null,
primary key (person_id)
);

create table people_attributes (
person_id int references people (person_id) on delete cascade initially
deferred,
askmecount int not null default 0,
age int[] not null default '{1}'::int[],
gender int[] not null default '{1}'::int[],
orientation int[] not null default '{1}'::int[],
bodytype int[] not null default '{1}'::int[],
children int[] not null default '{1}'::int[],
drinking int[] not null default '{1}'::int[],
education int[] not null default '{1}'::int[],
ethnicity int[] not null default '{1}'::int[],
eyecolor int[] not null default '{1}'::int[],
haircolor int[] not null default '{1}'::int[],
hairstyle int[] not null default '{1}'::int[],
height int[] not null default '{1}'::int[],
income int[] not null default '{1}'::int[],
occupation int[] not null default '{1}'::int[],
relation int[] not null default '{1}'::int[], /* multiple answer */
religion int[] not null default '{1}'::int[],
seeking int[] not null default '{1}'::int[], /* multiple answer */
smoking int[] not null default '{1}'::int[],
want_children int[] not null default '{1}'::int[],
weight int[] not null default '{1}'::int[],

primary key (person_id)
)
without oids;

create index people_attributes_search on people_attributes using gist (
age gist__int_ops,
gender gist__int_ops,
orientation gist__int_ops,
bodytype gist__int_ops,
children gist__int_ops,
drinking gist__int_ops,
education gist__int_ops,
ethnicity gist__int_ops,
eyecolor gist__int_ops,
haircolor gist__int_ops,
hairstyle gist__int_ops,
height gist__int_ops,
income gist__int_ops,
occupation gist__int_ops,
relation gist__int_ops,
religion gist__int_ops,
seeking gist__int_ops,
smoking gist__int_ops,
want_children gist__int_ops,
weight gist__int_ops
);

/* These will be compared against the people_attributes table */
create table people_searchprefs (
person_id int references people (person_id) on delete cascade initially
deferred,
age int[] not null default
'{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[],
gender int[] not null default '{1,2,4}'::int[],
orientation int[] not null default '{1,2,8}'::int[],
bodytype int[] not null default '{1,2,3,4,5,6}'::int[],
children int[] not null default '{0}'::int[],
drinking int[] not null default '{0}'::int[],
education int[] not null default '{0}'::int[],
ethnicity int[] not null default '{0}'::int[],
eyecolor int[] not null default '{0}'::int[],
haircolor int[] not null default '{0}'::int[],
hairstyle int[] not null default '{0}'::int[],
height int[] not null default '{0}'::int[],
income int[] not null default '{0}'::int[],
occupation int[] not null default '{0}'::int[],
relation int[] not null default '{0}'::int[],
religion int[] not null default '{0}'::int[],
seeking int[] not null default '{0}'::int[],
smoking int[] not null default '{0}'::int[],
want_children int[] not null default '{0}'::int[],
weight int[] not null default '{0}'::int[],

primary key (person_id)
)
without oids;

And now, the moment you've all been waiting for: performance!

(Number of profiles)

datingsite=> select count(*) from people_attributes ;
count
-------
96146
(1 row)

(age, gender and sexual orientation will always be a part of the query, and
are necessary to be invoke the index. The query is to show females, age 30-40
of any orientation. But first, without the index)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on people_attributes (cost=0.00..9078.56 rows=1 width=36) (actual
time=0.044..299.537 rows=937 loops=1)
Filter: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 304.707 ms
(3 rows)

( with the index )

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.064..52.383 rows=937
loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 57.032 ms
(3 rows)

(more realistically, it will have a limit of 10)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation limit 10;

QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..6.02 rows=1 width=36) (actual time=0.235..0.651 rows=10
loops=1)
-> Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.224..0.550 rows=10 loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 0.817 ms
(4 rows)

(slower with an sort key)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=6.03..6.03 rows=1 width=68) (actual time=62.572..66.338 rows=937
loops=1)
Sort Key: age
-> Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.223..55.999 rows=937
loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 71.206 ms
(5 rows)

(no better with a limit)

explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age limit 10;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10
loops=1)
-> Sort (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418
rows=10 loops=1)
Sort Key: age
-> Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937
loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[]
&& age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] &&
orientation))
Total runtime: 69.899 ms
(6 rows)

The last query is the most likely since I will need to be sorting by some key.
If I wasn't sorting it looks like it wouldn't be too bad, but sorting is
inevitable I think. I've only imported 96,146 of the 150,000 profiles. This
seems a bit slow now, and it doesn't look like it will scale.

My questions are:

- Is there a way of speeding up the sort?
- Will using queries like " WHERE orientation IN (1,2,4) " be any
better/worse?
- The queries with the GiST index are faster, but is it of any benefit when
the int[] arrays all contain a single value?
- Is there any hope for this structure?

Thanks for the suggestion Greg, and thanks to those who responded to this
thread.

On Thursday 16 September 2004 02:44, Greg Stark wrote:
> The only kind of index that is capable of indexing this type of data
> structure for arbitrary searches would be a GiST index. I'm not aware of
> any implementation for bitfields, though it would be an appropriate use.
>
> What there is now is the contrib/intarray package. You would have to store
> more than just the bitfields, you would have to store an array of integer
> flags. That might be denser actually if you end up with many flags few of
> which are set.
>
> GiST indexes allow you to search arbitrary combinations of set and unset
> flags. using the "@@" operator
>
> int[] @@ query_int - returns TRUE if array satisfies query (like
> '1&(2|3)')
>
> You might be able to look at the code there and adapt it to apply to bit
> fields. If so I think it would be a useful tool. But GiST indexing is
> pretty esoteric stuff.

primary key (person_id)
)
without oids;

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Stark 2004-09-19 05:07:56 Re: Comparing user attributes with bitwise operators
Previous Message Steinar H. Gunderson 2004-09-18 20:03:36 Re: Planner having way wrong estimate for group aggregate