Re: Removing redundant itemsets

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Allan Kamau <allank(at)sanbi(dot)ac(dot)za>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Removing redundant itemsets
Date: 2008-03-31 11:29:22
Message-ID: 47F0CB12.5010506@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Craig Ringer wrote:
> Allan Kamau wrote:
>> Hi all,
>> I have a list of purchases (market basket) and I would like to select
>> non redundant longest possible patterns by eliminating
>> (creating/populating other table to contain only non redandant itemsets)
>> purchases having item lists which are fully included in at least one
>> other purchase.
>
> Here's a possibly slow and surely ugly solution (I think it's right,
> though I haven't done more than passing testing):
>
>
>
> CREATE VIEW togo_as_arr AS
> SELECT a.tid,
> ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item)
> AS items
> FROM togo a GROUP BY tid;
>
> SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by
> FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b
> WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items;

Alternately:

-- Helps with the massively repeated subquery below
CREATE INDEX togo_by_tid_and_item ON togo(tid,item);

-- Find any `a' for which `item_from_a_is_in_b' is
-- true for all items in `a'
SELECT a_tid AS is_redundant, b_tid AS contained_by
FROM (
-- For every item in every pair of purchases,
-- determine whether the item in purchase `a'
-- was also in purchase `b'.
SELECT
a.tid AS a_tid,
b.tid AS b_tid,
a.item AS item,
EXISTS(
-- Was this item from `a' also in the `b' purchase?
SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item
) AS item_from_a_is_in_b
FROM togo a INNER JOIN togo b ON (a.tid <> b.tid)
GROUP BY a.tid, b.tid, a.item) AS item_containment
GROUP BY a_tid, b_tid
HAVING every(item_from_a_is_in_b);

... which avoids the array building, but is actually considerably slower
on the trivial test data. That's not too surprising given that this
approach requires a subquery:

SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item

for EVERY item to be tested. Twice, actually, as each record appears in
both `a' and `b' positions.

I'd be very interested to see what happened on real world test data,
especially compared to doing the array accumulation based query off a
temporary table instead of a view.

I suspect it'll depend on the average number of items per purchase -
lots of items per purchase and the array building cost will dominate.
That's really just a guess, though.

I'm sure there's a properly smart way to do this that I just can't
figure out, but this is the best I've come up with so far.

--
Craig Ringer

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Craig Ringer 2008-03-31 11:48:34 Re: Removing redundant itemsets
Previous Message Craig Ringer 2008-03-31 10:53:28 Re: Removing redundant itemsets