Re: Comparing two slices within one table efficiently

From: Andrew Kroeger <andrew(at)sprocks(dot)gotdns(dot)com>
To: ksimpson(at)mailchannels(dot)com
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Comparing two slices within one table efficiently
Date: 2007-08-13 19:25:16
Message-ID: 46C0B01C.3020307@sprocks.gotdns.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Ken Simpson wrote:
> I have a table with the following simplified form:
>
> create table t (
> run_id integer,
> domain_id integer,
> mta_id integer,
> attribute1 integer,
> attribute2 integer,
> unique(run_id, domain_id, mta_id)
> );
>
> The table has about 1 million rows with run_id=1, another 1 million rows with run_id=2, and so on.
>
> I need to efficiently query the differences between "runs" - i.e. For each (domain_id, mta_id) tuple in run 1, is there a coresponding tuple in run 2 where either attribute1 or attribute2 have changed?
>
> The only way I have been able to think of doing this so far is an o(n^2) search, which even with indexes takes a long time. e.g.
>
> select * from t t1 where exists (select 1 from t t2 where t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1 != t1.attribute1 or t2.attribute2 != t1.attribute2)
>
> This query takes millenia...
>
> Any help would be greatly appreciated. I hope I am naively missing some obvious alternative strategy, since this sort of operation must be common in databases.

A self-join will probably perform better in your case. Consider the
following query:

select
t1.domain_id as domain_id,
t1.mta_id as mta_id,
t1.run_id as run_id_1,
t1.attribute1 as attribute1_1,
t1.attribute2 as attribute2_1,
t2.run_id as run_id_2,
t2.attribute1 as attribute1_2,
t2.attribute2 as attribute2_2
from
t t1
join t t2 on
t1.mta_id = t2.mta_id and
t1.domain_id = t2.domain_id and
(t1.attribute1 != t2.attribute1 or t1.attribute2 != t2.attribute2)

You are basically joining together 2 copies of the same table (one
aliased to "t1" and the other aliased to "t2"). The logic in the join
conditions are the same you had in the subselect.

You will probably want to constrain the query some more in the join
conditions. As my example above is currently written, you will get 2
records for each true difference (e.g. for a given mta_id/domain_id,
you'll get 1 record when run_id X comes from t1 and run_id Y comes from
t2 & you get another record describing the same difference when run_id X
comes from t2 and run_id Y comes from t1).

Some example constraints that might help are:

t1.run_id < t2.run_id

Assuming run_id increases chronologically, this will return 1 record for
each case there is a difference in attributes from any prior run.

t1.run_id + 1 = t2.run_id

Again assuming run_id increases chronologically, this will only return a
record when there are attribute differences between 2 successive runs.

Indexes should also be considered for the performance of this query. At
a minimum, you will probably want a compound index on (domain_id,
mta_id). [... thinks for a bit ...] Actually, you could re-order your
unique constraint to take care of this, as a unique constraint is
basically an index with a unique constraint on it. Your unique
constraint is currently on (run_id, domain_id, mta_id). If you re-order
that unique constraint to (domain_id, mta_id, run_id), you will probably
see a couple of benefits:

- Due to how PG can use the first parts of a compound index, it will be
able to use the (domain_id, mta_id) portion as an index to help your joins.

- I assume domain_id is much more selective than run_id, so the unique
checks should speed up as well. The way you have the unique constraint
written, the index is first scanned based on run_id which isn't very
selective - once it finds the correct run_id in the index, there will
still be a lot of index entries to scan to match on domain_id and
mta_id. Re-ordering the way I propose should narrow down the number of
index entries quite quickly, as it will first be narrowed on domain_id,
then mta_id, and finally run_id.

Hope this helps.

Andrew

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Christian Kindler 2007-08-13 19:29:52 how to moce back in refcursor
Previous Message chester c young 2007-08-13 18:45:41 Re: Comparing two slices within one table efficiently