Comparing user attributes with bitwise operators

From: Patrick Clery <patrick(at)phpforhire(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Comparing user attributes with bitwise operators
Date: 2004-09-16 07:41:37
Message-ID: 200409160141.37604.patrick@phpforhire.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I'm working on a dating/personals/match-making site, that has used many
different methods of "match-making", that all seem to be very slow. One I am
attempting now that seems to be an efficient method of storage, but not the
best for indexing, is using bitwise operators to compare one person's profile
to another's.

This method seems to be very fast on a small scale, but I am dealing with a
large user-base here, in excess of 200,000 users that will be executing this
search function every time they login (the search results of their profile
will appear on the main page after they have logged in). I've opted to use
"label tables" for each possible set of answers. (i.e: Marital Status)

For this table, the set of bits -- bit(5) -- are represented as such:

+-----+------------+
| Bit | Meaning |
+-----+------------+
| 1 | single |
| 2 | separated |
| 3 | married |
| 4 | divorced |
| 5 | widowed |
+-----+------------+

Here's the structure of the marital status table:

# \d marital_matrix
Table "public.marital_matrix"
Column | Type | Modifiers
-----------+----------------+-----------------------------------------------------------------------
member_id | integer | not null default
nextval('public.marital_matrix_member_id_seq'::text)
status | bit varying(5) | not null default (0)::bit(5)
p_status | bit varying(5) | not null default (0)::bit(5)
Indexes:
"marital_matrix_pkey" PRIMARY KEY, btree (member_id)
"idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
"idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
"idx_marital_p_status" btree (p_status)
"idx_marital_status" btree (status)
Foreign-key constraints:
"$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE
CASCADE DEFERRABLE INITIALLY DEFERRED

To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a
smaller sample than what I will actually be using):

datingsite=> select count(*),status,p_status from marital_matrix group by
status,p_status;
count | status | p_status
-------+--------+----------
89 | 00001 | 00000
1319 | 00010 | 00000
2465 | 00100 | 00000
1 | 00100 | 11111
46117 | 10000 | 00000

here is the user I'll be comparing against, which has selected that he be
matched with any but married people:

datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
member_id | status | p_status
-----------+--------+----------
21 | 10000 | 11011
(1 row)

Here are a few possible comparison methods I can think of (NOTE: tests were
run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:

METHOD 1: Any bit that is set in p_status (prefered marital status) of the
searching user should be set in the potential match's marital status. This is
the method I'd like to improve, if possible. Running the query twice didn't
produce a different runtime.

EXPLAIN ANALYZE
SELECT
m2.member_id
FROM
marital_matrix m1, marital_matrix m2
WHERE
m1.member_id = 21 AND
m2.status & m1.p_status != B'00000';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..2357.79 rows=49742 width=4) (actual
time=18.062..708.469 rows=47525 loops=1)
Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <>
B'00000'::"bit")
-> Index Scan using marital_matrix_pkey on marital_matrix m1
(cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
Index Cond: (member_id = 21)
-> Seq Scan on marital_matrix m2 (cost=0.00..1602.91 rows=49991 width=13)
(actual time=17.966..255.529 rows=49991 loops=1)
Total runtime: 905.694 ms
(6 rows)

METHOD 2: Specifying the value (I don't think this would make a difference,
but I'll post anyways):

EXPLAIN ANALYZE
SELECT
member_id
FROM
marital_matrix
WHERE
status & B'11011' != B'00000';

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Seq Scan on marital_matrix (cost=0.00..1852.87 rows=49742 width=4) (actual
time=18.113..281.101 rows=47525 loops=1)
Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
Total runtime: 480.836 ms
(3 rows)

METHOD 3: Checking for one bit only. This is definitely not a "real world"
example and unacceptable since the p_status column can and will have multiple
bits. For categories other than "Marital Status", such as "Prefered Hair
Color", the users are likely to select multiple bits (they choose all that
apply). This query does use the index, but is still not very fast at all:

EXPLAIN ANALYZE
SELECT
member_id
FROM
marital_matrix m1
WHERE
status & B'10000' = B'10000';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_marital_matrix_single on marital_matrix m1
(cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117
loops=1)
Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
Total runtime: 451.162 ms
(3 rows)

METHOD 4: Using an IN statement. This method seems to be very fussy about
using the index, and I have at some point made it use the index when there
are less than 3 possibilites. Also, for fields other than Marital Status,
users will be able to select many bits for their own profile, which means
there would be many permutations:

EXPLAIN ANALYZE
SELECT
member_id
FROM
marital_matrix
WHERE
status & B'11011' IN (B'10000',B'01000');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on marital_matrix (cost=0.00..2602.73 rows=993 width=4) (actual
time=17.845..288.279 rows=47525 loops=1)
Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR
(((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit"
& B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit")
= B'00001'::"bit"))
Total runtime: 488.651 ms
(3 rows)

Method 3 is the only one that used the index, but the only really acceptable
method here is Method 1.

My questions are...
- Is there any hope in getting this to use an efficient index?
- Any mathmaticians know if there is a way to reorder my bitwise comparison to
have the operator use = and not an != (perhaps to force an index)? (AFAIK,
the answer to the second question is no)

If anyone could offer any performance tips here I'd really appreciate it. I
imagine that having this schema wouldn't last an hour with the amount of CPU
cycles it would be consuming on math operations.

Also, I have read the thread that was posted here by Daniel in August:
http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php

I have spoke with Daniel on this issue and we both agree it's very difficult
to find a solution that can scale to very large sites.

I would very much appreciate any advice that some experienced users may have
to offer me for such a situation. TIA

Patrick

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Christopher Kings-Lynne 2004-09-16 08:11:00 Re: Comparing user attributes with bitwise operators
Previous Message Joe Conway 2004-09-16 05:17:45 Re: Data Warehouse Reevaluation - MySQL vs Postgres --