From: | Patrick Clery <patrick(at)phpforhire(dot)com> |
---|---|
To: | Greg Stark <gsstark(at)mit(dot)edu> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Comparing user attributes with bitwise operators |
Date: | 2004-09-19 03:26:13 |
Message-ID: | 200409182126.13686.patrick@phpforhire.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
I have currently implemented a schema for my "Dating Site" that is storing
user search preferences and user attributes in an int[] array using the
contrib/intarray package (suggested by Greg Stark). But there are a few
problems.
a) query_int can't be cast to int4.
b) query_int can't be indexed.
datingsite=> alter table people_attributes add column bla query_int;
ALTER TABLE
datingsite=> create index idx_query_int on people_attributes (bla);
ERROR: data type query_int has no default operator class for access method
"btree"
HINT: You must specify an operator class for the index or define a default
operator class for the data type.
datingsite=> create index idx_query_int on people_attributes (bla
gist__int_ops);
ERROR: operator class "gist__int_ops" does not exist for access method
"btree"
datingsite=> alter table people_attributes drop column bla;
ALTER TABLE
c) query_int can only be used in one operation against int[]:
README.intarray:
int[] @@ query_int - returns TRUE if array satisfies query (like '1&(2|3)')
It is not possible to use >=, <=, =, etc. Also, this operator does not work
like example says:
datingsite=> select '{2,3}'::int[] @@ '1'::query_int;
?column?
----------
f
(1 row)
d) I can't find a way to simply check if an integer is an array without
declaring it as an array; Therefore, I need to use an int[] type for a column
that will only be storing one int4 if I want to compare it to an int[] array:
README.intarray:
int[] && int[] - overlap - returns TRUE if arrays has at least one common
elements.
e) int[] and query_int are somewhat ugly to deal with since query_int needs
to be quoted as a string, and int[] is returned as '{1,2,3}'. Or maybe I'm
just being anal :)
Because of these limitations, I've chosen to declare the attribute columns as
int[] arrays (even though they will only contain one value) so that I can use
'{1,2,3}'::int[] && column_name:
README.intarray:
int[] && int[] - overlap - returns TRUE if arrays has at least one common
elements.
Here is the schema:
create table people (
person_id serial,
datecreated timestamp with time zone default now (),
signup_ip cidr not null,
username character varying(30) not null,
password character varying(28) not null,
email character varying(65) not null,
dob date not null,
primary key (person_id)
);
create table people_attributes (
person_id int references people (person_id) on delete cascade initially
deferred,
askmecount int not null default 0,
age int[] not null default '{1}'::int[],
gender int[] not null default '{1}'::int[],
orientation int[] not null default '{1}'::int[],
bodytype int[] not null default '{1}'::int[],
children int[] not null default '{1}'::int[],
drinking int[] not null default '{1}'::int[],
education int[] not null default '{1}'::int[],
ethnicity int[] not null default '{1}'::int[],
eyecolor int[] not null default '{1}'::int[],
haircolor int[] not null default '{1}'::int[],
hairstyle int[] not null default '{1}'::int[],
height int[] not null default '{1}'::int[],
income int[] not null default '{1}'::int[],
occupation int[] not null default '{1}'::int[],
relation int[] not null default '{1}'::int[], /* multiple answer */
religion int[] not null default '{1}'::int[],
seeking int[] not null default '{1}'::int[], /* multiple answer */
smoking int[] not null default '{1}'::int[],
want_children int[] not null default '{1}'::int[],
weight int[] not null default '{1}'::int[],
primary key (person_id)
)
without oids;
create index people_attributes_search on people_attributes using gist (
age gist__int_ops,
gender gist__int_ops,
orientation gist__int_ops,
bodytype gist__int_ops,
children gist__int_ops,
drinking gist__int_ops,
education gist__int_ops,
ethnicity gist__int_ops,
eyecolor gist__int_ops,
haircolor gist__int_ops,
hairstyle gist__int_ops,
height gist__int_ops,
income gist__int_ops,
occupation gist__int_ops,
relation gist__int_ops,
religion gist__int_ops,
seeking gist__int_ops,
smoking gist__int_ops,
want_children gist__int_ops,
weight gist__int_ops
);
/* These will be compared against the people_attributes table */
create table people_searchprefs (
person_id int references people (person_id) on delete cascade initially
deferred,
age int[] not null default
'{18,19,20,21,22,23,24,25,26,27,28,29,30}'::int[],
gender int[] not null default '{1,2,4}'::int[],
orientation int[] not null default '{1,2,8}'::int[],
bodytype int[] not null default '{1,2,3,4,5,6}'::int[],
children int[] not null default '{0}'::int[],
drinking int[] not null default '{0}'::int[],
education int[] not null default '{0}'::int[],
ethnicity int[] not null default '{0}'::int[],
eyecolor int[] not null default '{0}'::int[],
haircolor int[] not null default '{0}'::int[],
hairstyle int[] not null default '{0}'::int[],
height int[] not null default '{0}'::int[],
income int[] not null default '{0}'::int[],
occupation int[] not null default '{0}'::int[],
relation int[] not null default '{0}'::int[],
religion int[] not null default '{0}'::int[],
seeking int[] not null default '{0}'::int[],
smoking int[] not null default '{0}'::int[],
want_children int[] not null default '{0}'::int[],
weight int[] not null default '{0}'::int[],
primary key (person_id)
)
without oids;
And now, the moment you've all been waiting for: performance!
(Number of profiles)
datingsite=> select count(*) from people_attributes ;
count
-------
96146
(1 row)
(age, gender and sexual orientation will always be a part of the query, and
are necessary to be invoke the index. The query is to show females, age 30-40
of any orientation. But first, without the index)
explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on people_attributes (cost=0.00..9078.56 rows=1 width=36) (actual
time=0.044..299.537 rows=937 loops=1)
Filter: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 304.707 ms
(3 rows)
( with the index )
explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.064..52.383 rows=937
loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 57.032 ms
(3 rows)
(more realistically, it will have a limit of 10)
explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation limit 10;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..6.02 rows=1 width=36) (actual time=0.235..0.651 rows=10
loops=1)
-> Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=36) (actual time=0.224..0.550 rows=10 loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 0.817 ms
(4 rows)
(slower with an sort key)
explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=6.03..6.03 rows=1 width=68) (actual time=62.572..66.338 rows=937
loops=1)
Sort Key: age
-> Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.223..55.999 rows=937
loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age)
AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
Total runtime: 71.206 ms
(5 rows)
(no better with a limit)
explain analyze
select person_id, gender
from people_attributes
where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
and '{2}'::int[] && gender
and '{1,2,4}'::int[] && orientation order by age limit 10;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=6.03..6.03 rows=1 width=68) (actual time=69.391..69.504 rows=10
loops=1)
-> Sort (cost=6.03..6.03 rows=1 width=68) (actual time=69.381..69.418
rows=10 loops=1)
Sort Key: age
-> Index Scan using people_attributes_search on people_attributes
(cost=0.00..6.02 rows=1 width=68) (actual time=0.068..61.648 rows=937
loops=1)
Index Cond: (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[]
&& age) AND ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] &&
orientation))
Total runtime: 69.899 ms
(6 rows)
The last query is the most likely since I will need to be sorting by some key.
If I wasn't sorting it looks like it wouldn't be too bad, but sorting is
inevitable I think. I've only imported 96,146 of the 150,000 profiles. This
seems a bit slow now, and it doesn't look like it will scale.
My questions are:
- Is there a way of speeding up the sort?
- Will using queries like " WHERE orientation IN (1,2,4) " be any
better/worse?
- The queries with the GiST index are faster, but is it of any benefit when
the int[] arrays all contain a single value?
- Is there any hope for this structure?
Thanks for the suggestion Greg, and thanks to those who responded to this
thread.
On Thursday 16 September 2004 02:44, Greg Stark wrote:
> The only kind of index that is capable of indexing this type of data
> structure for arbitrary searches would be a GiST index. I'm not aware of
> any implementation for bitfields, though it would be an appropriate use.
>
> What there is now is the contrib/intarray package. You would have to store
> more than just the bitfields, you would have to store an array of integer
> flags. That might be denser actually if you end up with many flags few of
> which are set.
>
> GiST indexes allow you to search arbitrary combinations of set and unset
> flags. using the "@@" operator
>
> int[] @@ query_int - returns TRUE if array satisfies query (like
> '1&(2|3)')
>
> You might be able to look at the code there and adapt it to apply to bit
> fields. If so I think it would be a useful tool. But GiST indexing is
> pretty esoteric stuff.
primary key (person_id)
)
without oids;
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2004-09-19 05:07:56 | Re: Comparing user attributes with bitwise operators |
Previous Message | Steinar H. Gunderson | 2004-09-18 20:03:36 | Re: Planner having way wrong estimate for group aggregate |