Re: count of occurences PLUS optimisation

From: "Thurstan R(dot) McDougle" <trmcdougle(at)my-deja(dot)com>
To: pgsql-hackers(at)postgresql(dot)org(dot)pgsql-general(at)postgresql(dot)org
Subject: Re: count of occurences PLUS optimisation
Date: 2001-09-13 12:03:54
Message-ID: 3BA0A0AA.5E8D42C4@my-deja.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

HACKERS: see the end of this message about a possible optimisation for
ORDER BY+LIMIT cases (the normal use of LIMIT?)

Adam wrote:
>
> I help run a job database and have a table of search records. I want
> a query that will return the top 10 jobs by search frequency. I'm
> familiar with ORDER BY and LIMIT, so I basically need this:
>
> Given a table search_records:
> job_num
> -------
> 1
> 2
> 2
> 3
> 4
> 4
> 4
>
> I want a query that will return:
> job_num | count
> --------+------
> 1 |1
> 2 |2
> 3 |1
> 4 |3
>
> I tried
>
> select distinct job_num, (select count(*) from search_records j where
> j.job_num=k.job_num) from search_records k
>
> but it is horribly slow (it takes several minutes on a table of about
> 25k rows!). I assume it scans the entire table for every job_num in
> order to count the number of occurences of that job_num, taking order
> n^2 time. Since I can easily use job_num as an index (being integers
> from 0 to roughly 400 so far) I could just do a "select * from
> search_records" and do the counting in PHP (our HTML pre-processor) in
> order n time. However, I don't know how to do an order n*log(n) sort
> in PHP, just n^2, so there would still be an efficiency problem.
> I have Postgresql 7.0.3.
> Help is of course greatly appreciated.

I have not tried it but how about:-

select job_num from
(select job_num, count(*) as c from search_records group by job_num)
order by c limit 10;

I am not sure if count(*) would work in this context, if not try count()
on some field that is in every record.

If you can be sure that the top 10 will have at least a certain
threshold of searches (perhaps >1!) then it MIGHT be faster, due to less
data being sorted for the outer selects order by, (experiment) to do:-

select job_num from
(select job_num, count(*) as c from search_records group by job_num
HAVING c>1)
order by c limit 10;

It would depend on how efficient the ORDER BY and LIMIT work together.
(The ORDER BY could build a list of LIMIT n items and just replace items
in that list...a lot more efficient both of memory and comparisons than
building the full list and then keeping the top n)

HACKERS: If it does not do this it might be a usefull optimisation.
There would probably need to be a cutoff limit on whether to apply this
method or sort and keep n. Also for LIMIT plus OFFSET it would need to
build a list of the the total of the LIMIT and OFFSET figures.

--
This is the identity that I use for NewsGroups. Email to
this will just sit there. If you wish to email me replace
the domain with knightpiesold . co . uk (no spaces).

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Fernando Schapachnik 2001-09-13 12:09:28 Re: where cannot use alias name of column?
Previous Message Thurstan R. McDougle 2001-09-13 11:32:20 Re: Inheritance c.cities ??

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2001-09-13 14:02:31 Re: count of occurences PLUS optimisation
Previous Message Haller Christoph 2001-09-13 11:17:46 ERROR: Cannot insert a duplicate key into a unique index