Skip site navigation (1) Skip section navigation (2)

Re: Comparing user attributes with bitwise operators

From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: Patrick Clery <patrick(at)phpforhire(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Comparing user attributes with bitwise operators
Date: 2004-09-16 08:11:00
Message-ID: 41494A94.5080800@familyhealth.com.au (view raw or flat)
Thread:
Lists: pgsql-performance
Sounds like you want a many-to-many table that maps user_ids to match_ids

Then you can put an index over (user_id, match_id) and the search will 
be very fast.

Chris

Patrick Clery wrote:
> I'm working on a dating/personals/match-making site, that has used many 
> different methods of "match-making", that all seem to be very slow. One I am 
> attempting now that seems to be an efficient method of storage, but not the 
> best for indexing, is using bitwise operators to compare one person's profile 
> to another's.
> 
> This method seems to be very fast on a small scale, but I am dealing with a 
> large user-base here, in excess of 200,000 users that will be executing this 
> search function every time they login (the search results of their profile 
> will appear on the main page after they have logged in). I've opted to use 
> "label tables" for each possible set of answers. (i.e: Marital Status)
> 
> For this table, the set of bits -- bit(5) -- are represented as such:
> 
> +-----+------------+
> | Bit | Meaning    |
> +-----+------------+
> | 1   | single     |
> | 2   | separated  |
> | 3   | married    |
> | 4   | divorced   |
> | 5   | widowed    |
> +-----+------------+
> 
> Here's the structure of the marital status table:
> 
> # \d marital_matrix                                                                                                                 
> Table "public.marital_matrix"
>   Column   |      Type      |                               Modifiers                               
> -----------+----------------+-----------------------------------------------------------------------
>  member_id | integer        | not null default 
> nextval('public.marital_matrix_member_id_seq'::text)
>  status    | bit varying(5) | not null default (0)::bit(5)
>  p_status  | bit varying(5) | not null default (0)::bit(5)
> Indexes:
>     "marital_matrix_pkey" PRIMARY KEY, btree (member_id)
>     "idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
>     "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
>     "idx_marital_p_status" btree (p_status)
>     "idx_marital_status" btree (status)
> Foreign-key constraints:
>     "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE 
> CASCADE DEFERRABLE INITIALLY DEFERRED
> 
> To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a 
> smaller sample than what I will actually be using):
> 
> datingsite=> select count(*),status,p_status from marital_matrix group by 
> status,p_status;
>  count | status | p_status 
> -------+--------+----------
>     89 | 00001  | 00000
>   1319 | 00010  | 00000
>   2465 | 00100  | 00000
>      1 | 00100  | 11111
>  46117 | 10000  | 00000
> 
> here is the user I'll be comparing against, which has selected that he be 
> matched with any but married people:
> 
> datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
>  member_id | status | p_status 
> -----------+--------+----------
>         21 | 10000  | 11011
> (1 row)
> 
> 
> 
> 
> Here are a few possible comparison methods I can think of (NOTE: tests were 
> run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:
> 
> 
> METHOD 1: Any bit that is set in p_status (prefered marital status) of the 
> searching user should be set in the potential match's marital status. This is 
> the method I'd like to improve, if possible. Running the query twice didn't 
> produce a different runtime.
> 
> EXPLAIN ANALYZE
> SELECT
>     m2.member_id
> FROM
>     marital_matrix m1, marital_matrix m2
> WHERE
>     m1.member_id = 21 AND
>     m2.status & m1.p_status != B'00000';
>                                                                 QUERY PLAN                                                                  
> ---------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..2357.79 rows=49742 width=4) (actual 
> time=18.062..708.469 rows=47525 loops=1)
>    Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <> 
> B'00000'::"bit")
>    ->  Index Scan using marital_matrix_pkey on marital_matrix m1  
> (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
>          Index Cond: (member_id = 21)
>    ->  Seq Scan on marital_matrix m2  (cost=0.00..1602.91 rows=49991 width=13) 
> (actual time=17.966..255.529 rows=49991 loops=1)
>  Total runtime: 905.694 ms
> (6 rows)
> 
> 
> METHOD 2: Specifying the value (I don't think this would make a difference, 
> but I'll post anyways):
> 
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix
> WHERE
>     status & B'11011' != B'00000';
> 
>                                                       QUERY PLAN                                                      
> ----------------------------------------------------------------------------------------------------------------------
>  Seq Scan on marital_matrix  (cost=0.00..1852.87 rows=49742 width=4) (actual 
> time=18.113..281.101 rows=47525 loops=1)
>    Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
>  Total runtime: 480.836 ms
> (3 rows)
> 
> 
> METHOD 3: Checking for one bit only. This is definitely not a "real world" 
> example and unacceptable since the p_status column can and will have multiple 
> bits. For categories other than "Marital Status", such as "Prefered Hair 
> Color", the users are likely to select multiple bits (they choose all that 
> apply). This query does use the index, but is still not very fast at all:
> 
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix m1
> WHERE
>     status & B'10000' = B'10000';
>                                                                       QUERY 
> PLAN                                                                       
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>  Index Scan using idx_marital_matrix_single on marital_matrix m1  
> (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117 
> loops=1)
>    Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
>  Total runtime: 451.162 ms
> (3 rows)
> 
> METHOD 4: Using an IN statement. This method seems to be very fussy about 
> using the index, and I have at some point made it use the index when there 
> are less than 3 possibilites. Also, for fields other than Marital Status, 
> users will be able to select many bits for their own profile, which means 
> there would be many permutations:
> 
> EXPLAIN ANALYZE
> SELECT
>     member_id
> FROM
>     marital_matrix
> WHERE
>     status & B'11011' IN (B'10000',B'01000');
>           QUERY PLAN                                                                                                                      
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Seq Scan on marital_matrix  (cost=0.00..2602.73 rows=993 width=4) (actual 
> time=17.845..288.279 rows=47525 loops=1)
>    Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR 
> (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit" 
> & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit") 
> = B'00001'::"bit"))
>  Total runtime: 488.651 ms
> (3 rows)
> 
> 
> Method 3 is the only one that used the index, but the only really acceptable 
> method here is Method 1.
> 
> My questions are...
> - Is there any hope in getting this to use an efficient index?
> - Any mathmaticians know if there is a way to reorder my bitwise comparison to 
> have the operator use = and not an != (perhaps to force an index)? (AFAIK, 
> the answer to the second question is no)
> 
> If anyone could offer any performance tips here I'd really appreciate it. I 
> imagine that having this schema wouldn't last an hour with the amount of CPU 
> cycles it would be consuming on math operations.
> 
> Also, I have read the thread that was posted here by Daniel in August:
> http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php
> 
> I have spoke with Daniel on this issue and we both agree it's very difficult 
> to find a solution that can scale to very large sites.
> 
> I would very much appreciate any advice that some experienced users may have 
> to offer me for such a situation. TIA
> 
> Patrick
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

In response to

Responses

pgsql-performance by date

Next:From: Greg StarkDate: 2004-09-16 08:44:29
Subject: Re: Comparing user attributes with bitwise operators
Previous:From: Patrick CleryDate: 2004-09-16 07:41:37
Subject: Comparing user attributes with bitwise operators

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group