Re: Self-referencing table question

From: Edmund Bacon <ebacon(at)onesystem(dot)com>
To: Sean Davis <sdavis2(at)mail(dot)nih(dot)gov>
Cc: PostgreSQL SQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Self-referencing table question
Date: 2005-03-24 18:11:29
Message-ID: 424302D1.7000602@onesystem.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Sean Davis wrote:

> Thanks. I thought about that a bit and it seems like it is highly
> likely to be expensive for a single query (though I should probably
> try it at some point). If I do find myself reformatting results after
> response to user input (i.e., reusing the query), though, then your
> solution is likely to be very useful.
>
You might think so, however, consider:

$ cat temptable.sql
SELECT to_id
INTO TEMP TABLE tids
FROM correlation
WHERE from_id = 1234
ORDER BY val DESC limit 100;

SELECT t1.to_id AS from_id, t2.to_id
INTO TEMP TABLE from_to
FROM tids t1, tids t2
WHERE t1.to_id > t2.to_id;

SELECT c.from_id, c.to_id, c.val
FROM from_to
JOIN correlation c USING(from_id, to_id)
WHERE val > 0.5
order by from_id, to_id;
$ cat subselect.sql
select
from_id, to_id, val
from correlation
where from_id in (select to_id from correlation
where from_id = 1234
order by val desc limit 100)
and to_id in (select to_id from correlation
where from_id = 1234
order by val desc limit 100)
and from_id > to_id
and val > 0.5
order by from_id, to_id;
$ psql -q -c "select count(1) from correlation" test
count
----------
64000000
(1 row)

$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -

real 0m3.093s
user 0m0.000s
sys 0m0.010s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -

real 0m0.238s
user 0m0.000s
sys 0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -

real 0m1.945s
user 0m0.010s
sys 0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -

real 0m1.949s
user 0m0.010s
sys 0m0.010s
$ time psql -q -f subselect.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -

real 0m1.953s
user 0m0.010s
sys 0m0.000s
$ time psql -q -f temptable.sql test | md5sum -
f289b259ce8a2a4a45f9e0eca6e31957 -

real 0m0.237s
user 0m0.020s
sys 0m0.000s
$

Note that the subselect version takes about 10 times as long as the
temptable version, and does not seem to be dependent on what data might
be cached.

I added the sort so that the results would be in the same order - you
can see that the two query sets are producing the same output (at least
md5sum thinks they are the same).

One thing to be aware of is the size of your returned data set - If it's
fairly large, then the transfer time from your web-server to the pgsql
box might overwhelm any "small" optimization in query time.

> Sean
>
> On Mar 24, 2005, at 11:13 AM, Edmund Bacon wrote:
>
>> Sometimes using a temp table is a better idea:
>>
>> e.g.
>>
>> -- start by creating a temp table 'tids' that hold the to_ids that
>> -- we are interested in.
>> SELECT to_id
>> INTO TEMP TABLE tids
>> FROM correlation
>> WHERE from_id = 1234
>> ORDER BY val DESC limit 100;
>>
>> -- The following temp table makes use of the primary key on
>> -- the correlation table, and the stated goal from the original
>> -- question that:
>> -- from_id > to_id
>> -- and from_id in (tids.to_id)
>> -- and to_id in (tids.to_id)
>>
>> SELECT t1.to_id AS from_id, t2.to_id
>> INTO TEMP TABLE from_to
>> FROM tids t1, tids t2
>> WHERE t1.to_id > t2.to_id;
>>
>> -- Now we can use the from_to table as an index into the correlation
>> -- table.
>>
>> SELECT c.from_id, c.to_id, c.val
>> FROM from_to
>> JOIN correlation c USING(from_id, to_id)
>> WHERE val > 0.5;
>>
>>
>> The explain analyze for the final select works out to:
>> Nested Loop (cost=0.00..50692.00 rows=8488 width=16) (actual
>> time=0.171..150.095 rows=2427 loops=1)
>> -> Seq Scan on from_to (cost=0.00..79.38 rows=5238 width=8)
>> (actual time=0.006..7.660 rows=4950 loops=1)
>> -> Index Scan using correlation_pkey on correlation c
>> (cost=0.00..9.63 rows=2 width=16) (actual time=0.024..0.025 rows=0
>> loops=4950)
>> Index Cond: (("outer".from_id = c.from_id) AND ("outer".to_id
>> = c.to_id))
>> Filter: (val > 0.5::double precision)
>> Total runtime: 152.261 ms
>>
>>
>> Richard Huxton wrote:
>>
>>> Sean Davis wrote:
>>>
>>>> I answer my own question, if only for my own records. The
>>>> following query is about 5-6 times faster than the original. Of
>>>> course, if anyone else has other ideas, I'd be happy to hear them.
>>>>
>>>> Sean
>>>>
>>>> explain analyze select from_id,to_id,val from exprsdb.correlation
>>>> where from_id in (select to_id from exprsdb.correlation where
>>>> from_id=2424 order by val desc limit 100) and to_id in (select
>>>> to_id from exprsdb.correlation where from_id=2424 order by val
>>>> desc limit 100) and val>0.6 and to_id<from_id;
>>>
>>>
>>>
>>> Might not be any faster, but you can do this as a self-join with
>>> subquery:
>>>
>>> SELECT c1.from_id, c1.to_id, c1.val
>>> FROM
>>> correlation c1,
>>> (
>>> SELECT to_id FROM correlation WHERE from_id=2424
>>> ORDER BY val DESC LIMIT 100
>>> ) AS c2
>>> (
>>> SELECT to_id FROM correlation WHERE from_id=2424
>>> ORDER BY val DESC LIMIT 100
>>> ) AS c3
>>> WHERE
>>> c1.from_id = c2.to_id
>>> AND c1.to_id = c3.to_id
>>> AND c1.val > 0.5
>>> AND c1.to_id < from_id
>>> ;
>>>
>>> I think PG should be smart enough nowadays to figure out these two
>>> queries are basically the same.
>>
>>
>>
>> --
>> Edmund Bacon <ebacon(at)onesystem(dot)com>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>> message can get through to the mailing list cleanly
>
>

--
Edmund Bacon <ebacon(at)onesystem(dot)com>

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Moran.Michael 2005-03-24 18:59:08 Re: Funtions + plpgsql + contrib/pgcrypto = ??
Previous Message Jim Buttafuoco 2005-03-24 17:58:21 Re: Funtions + plpgsql + contrib/pgcrypto = ??