Re: Comparing user attributes with bitwise operators

From: Patrick Clery <patrick(at)phpforhire(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Comparing user attributes with bitwise operators
Date: 2004-10-05 06:39:23
Message-ID: 200410050039.24230.patrick@phpforhire.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Sorry I have taken this long to reply, Greg, but here are the results of the
personals site done with contrib/intarray:

The first thing I did was add a serial column to the attributes table. So
instead of having a unique constraint on (attribute_id,value_id), every row
has a unique value:

datingsite=> \d attribute_names
Table "public.attribute_names"
Column | Type |
Modifiers
----------------+-----------------------+---------------------------------------------------------------------------
attribute_id | integer | not null default
nextval('public.attribute_names_attribute_id_seq'::text)
attribute_name | character varying(50) | not null
Indexes:
"attribute_names_pkey" PRIMARY KEY, btree (attribute_id)
"attribute_names_attribute_id_key" UNIQUE, btree (attribute_id,
attribute_name

an example insert:
insert into attribute_names (attribute_name) values ('languages');

datingsite=> \d attribute_values
Table "public.attribute_values"
Column | Type |
Modifiers
--------------+------------------------+------------------------------------------------------------------------
attribute_id | integer | not null
order_id | integer | not null default
(nextval('order_id_seq'::text) - 1)
label | character varying(255) | not null
value_id | integer | not null default
nextval('public.attribute_values_value_id_seq'::text)
Indexes:
"attribute_values_pkey" PRIMARY KEY, btree (value_id)
Foreign-key constraints:
"attribute_values_attribute_id_fkey" FOREIGN KEY (attribute_id) REFERENCES
attribute_names(attribute_id)

an example insert (22 is the attribute_id of "languages"):
insert into attribute_values (attribute_id, label) values (22, 'English');

The "value_id" column is where the integers inside the int[] arrays will
reference. Even age (between 18-99) and height (between 48-84) have rows for
every possible choice, as well as "Ask me!" where a user could choose to
leave that blank.

Here is "the int[] table":

create table people_attributes (
person_id int references people (person_id) on delete cascade initially
deferred,
askmecount int not null default 0,
age int not null references attribute_values(value_id) on delete restrict,
gender int not null references attribute_values(value_id) on delete
restrict,
bodytype int not null references attribute_values(value_id) on delete
restrict,
children int not null references attribute_values(value_id) on delete
restrict,
drinking int not null references attribute_values(value_id) on delete
restrict,
education int not null references attribute_values(value_id) on delete
restrict,
ethnicity int not null references attribute_values(value_id) on delete
restrict,
eyecolor int not null references attribute_values(value_id) on delete
restrict,
haircolor int not null references attribute_values(value_id) on delete
restrict,
hairstyle int not null references attribute_values(value_id) on delete
restrict,
height int not null references attribute_values(value_id) on delete
restrict,
income int not null references attribute_values(value_id) on delete
restrict,
languages int[] not null,
occupation int not null references attribute_values(value_id) on delete
restrict,
orientation int not null references attribute_values(value_id) on delete
restrict,
relation int not null references attribute_values(value_id) on delete
restrict,
religion int not null references attribute_values(value_id) on delete
restrict,
smoking int not null references attribute_values(value_id) on delete
restrict,
want_children int not null references attribute_values(value_id) on delete
restrict,
weight int not null references attribute_values(value_id) on delete
restrict,

seeking int[] not null,

primary key (person_id)
)
without oids;

If you'll notice that "seeking" and "languages" are both int[] types. I did
this because those will be multiple choice. The index was created like so:

create index people_attributes_search on people_attributes using gist (
(array[
age,
gender,
orientation,
children,
drinking,
education,
ethnicity,
eyecolor,
haircolor,
hairstyle,
height,
income,
occupation,
relation,
religion,
smoking,
want_children,
weight
] + seeking + languages) gist__int_ops
);

seeking and languages are appended with the intarray + op.

I'm not going to go too in depth on how this query was generated since that
was mostly done with the PHP side of things, but from the structure it should
be obvious. I did, however, have to generate a few SQL functions using Smarty
templates since it would be way to tedious to map all these values by hand.

There are 96,000 rows (people) in the people_attributes table. Here is what is
going on in the following query: "Show me all single (48) females (88) who
are heterosexual (92) age between 18 and 31 (95|96|97|98|99|100|101|102|103|
104|105|106|107|108)"

EXPLAIN ANALYZE SELECT *
FROM people_attributes pa
WHERE person_id <> 1
AND (ARRAY[age, gender, orientation, children, drinking, education,
ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
relation, religion, smoking, want_children, weight] + seeking + languages) @@
'48 & 89 & 92 & ( 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105
| 106 | 107 | 108 )'

Index Scan using people_attributes_search on people_attributes pa
(cost=0.00..386.45 rows=96 width=140) (actual time=0.057..19.266 rows=516
loops=1)
Index Cond: (((ARRAY[age, gender, orientation, children, drinking,
education, ethnicity, eyecolor, haircolor, hairstyle, height, income,
occupation, relation, religion, smoking, want_children, weight] + seeking) +
languages) @@ '48 & 89 & 92 & ( ( ( ( ( ( ( ( ( ( ( ( ( 95 | 96 ) | 97 ) |
98 ) | 99 ) | 100 ) | 101 ) | 102 ) | 103 ) | 104 ) | 105 ) | 106 ) | 107 ) |
108 )'::query_int)
Filter: (person_id <> 1)
Total runtime: 21.646 ms

The speed only seems to vary significant on very broad searches, e.g: "All
females." But once the threshold of about 2 or more attributes is met, the
times are very acceptable.

If we get a little more specific by adding "non-smokers and non-drinkers
between 18 and 22", slight improvements:

EXPLAIN ANALYZE SELECT *
FROM people_attributes pa
WHERE person_id <> 1
AND (ARRAY[age, gender, orientation, children, drinking, education,
ethnicity, eyecolor, haircolor, hairstyle, height, income, occupation,
relation, religion, smoking, want_children, weight] + seeking + languages) @@
'48 & 89 & 92 & ( 95 | 96 | 97 | 98) & 67 & 2'

Index Scan using people_attributes_search on people_attributes pa
(cost=0.00..386.45 rows=96 width=140) (actual time=0.077..13.090 rows=32
loops=1)
Index Cond: (((ARRAY[age, gender, orientation, children, drinking,
education, ethnicity, eyecolor, haircolor, hairstyle, height, income,
occupation, relation, religion, smoking, want_children, weight] + seeking) +
languages) @@ '48 & 89 & 92 & ( ( ( 95 | 96 ) | 97 ) | 98 ) & 67 &
2'::query_int)
Filter: (person_id <> 1)
Total runtime: 13.393 ms

All in all, my final thoughts on this are that it is "hella faster" than the
previous methods. Vertical tables for your user attributes will not work for
a personals site -- there are just too many conditions to be able to
efficiently use an index. Out of all the methods I have tried, verticle table
was not even remotely scalable on large amounts of data. Horizontal table is
the way to go, but it wouldn't perform like this if not for the intarray
module. The array method works quite nicely, especially for the columns like
"languages" and "seeking" that are multiple choice. However, even though this
method is fast, I still might opt for caching the results because the "real
world" search query involves a lot more and will be executed non-stop. But to
have it run this fast the first time certainly helps.

The only drawback I can think of is that the attributes no longer have values
like 1,2,3 -- instead they could be any integer value. This puts a spin on
the programming side of things, which required me to write "code that writes
code" on a few occassions during the attribute "mapping" process. For
example, keeping an associative array of all the attributes without fetching
that data from the database each time. My advice: if you're not a masochist,
use a template engine (or simply parse out a print_r() ) to create these PHP
arrays or SQL functions.

Greg, thanks a lot for the advice. I owe you a beer ;)

On Saturday 18 September 2004 23:07, you wrote:
> Patrick Clery <patrick(at)phpforhire(dot)com> writes:
> > PLAN
> > -------------------------------------------------------------------------
> >--------------------------------------------------------------------------
> >-------------- Limit (cost=6.03..6.03 rows=1 width=68) (actual
> > time=69.391..69.504 rows=10 loops=1) -> Sort (cost=6.03..6.03 rows=1
> > width=68) (actual time=69.381..69.418 rows=10 loops=1) Sort Key: age
> > -> Index Scan using people_attributes_search on
> > people_attributes (cost=0.00..6.02 rows=1 width=68) (actual
> > time=0.068..61.648 rows=937 loops=1) Index Cond:
> > (('{30,31,32,33,34,35,36,37,38,39,40}'::integer[] && age) AND
> > ('{2}'::integer[] && gender) AND ('{1,2,4}'::integer[] && orientation))
> > Total runtime: 69.899 ms
> > (6 rows)
>
> ...
>
> > - Is there a way of speeding up the sort?
>
> The sort seems to have only taken 8ms out of 69ms or just over 10%. As long
> as the index scan doesn't match too many records the sort should never be
> any slower so it shouldn't be the performance bottleneck. You might
> consider putting a subquery inside the order by with a limit to ensure that
> the sort never gets more than some safe maximum. Something like:
>
> select * from (select * from people_attributes where ... limit 1000) order
> by age limit 10
>
> This means if the query matches more than 1000 it won't be sorted properly
> by age; you'll get the top 10 out of some random subset. But you're
> protected against ever having to sort more than 1000 records.
>
> > - Will using queries like " WHERE orientation IN (1,2,4) " be any
> > better/worse?
>
> Well they won't use the GiST index, so no. If there was a single column
> with a btree index then this would be the cleanest way to go.
>
> > - The queries with the GiST index are faster, but is it of any benefit
> > when the int[] arrays all contain a single value?
>
> Well you've gone from 5 minutes to 60ms. You'll have to do more than one
> test to be sure but it sure seems like it's of some benefit.
>
> If they're always a single value you could make it an expression index
> instead and not have to change your data model.
>
> Just have the fields be integers individually and make an index as:
>
> create index idx on people_attributes using gist (
> (array[age]) gist__int_ops,
> (array[gender]) gist__int_ops,
> ...
> )
>
>
> However I would go one step further. I would make the index simply:
>
> create index idx on people_attributes using gist (
> (array[age,gender,orientation,...]) gist__int_ops
> )
>
> And ensure that all of these attributes have distinct domains. Ie, that
> they don't share any values. There are 4 billion integer values available
> so that shouldn't be an issue.
>
> Then you could use query_int to compare them the way you want. You
> misunderstood how query_int is supposed to work. You index an array column
> and then you can check it against a query_int just as you're currently
> checking for overlap. Think of @@ as a more powerful version of the overlap
> operator that can do complex logical expressions.
>
> The equivalent of
>
> where '{30,31,32,33,34,35,36,37,38,39,40}'::int[] && age
> and '{2}'::int[] && gender
> and '{1,2,4}'::int[] && orientation
>
> would then become:
>
> WHERE array[age,gender,orientation] @@
> '(30|31|32|33|34|35|36|37|38|39|40)&(2)&(1|2|4)'
>
> except you would have to change orientation and gender to not both have a
> value of 2.
>
> You might consider doing the expression index a bit of overkill actually.
> You might consider just storing a column "attributes" with an integer array
> directly in the table.
>
> You would also want a table that lists the valid attributes to be sure not
> to have any overlaps:
>
> 1 age 1
> 2 age 2
> ...
> 101 gender male
> 102 gender female
> 103 orientation straight
> 104 orientation gay
> 105 orientation bi
> 106 bodytype scrawny
> ...
>
> > - Is there any hope for this structure?
>
> You'll have to test this carefully. I tried using GiST indexes for my
> project and found that I couldn't load the data and build the GiST indexes
> fast enough. You have to test the costs of building and maintaining this
> index, especially since it has so many columns in it.
>
> But it looks like your queries are in trouble without it so hopefully it'll
> be ok on the insert/update side for you.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Chris Hutchinson 2004-10-05 06:49:26 EXPLAIN ANALYZE much slower than running query normally
Previous Message Josh Berkus 2004-10-05 02:06:33 Re: would number of fields in a table affect search-query time?