counting text matches - any recipes?

From: "Ross J(dot) Reedstrom" <reedstrm(at)rice(dot)edu>
To: pgsql-sql(at)postgresql(dot)org
Subject: counting text matches - any recipes?
Date: 2002-11-04 22:53:20
Message-ID: 20021104225320.GA6001@wallace.ece.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hey all -
I'm working on the query interface to a system that stores a bunch of
(realtively) small text documents, with titles and keywords, etc.

For the basic interface, we have the user types some words into a textbox,
then just search 'everything'. I'm weighting the results, so that hits
on the title count more than on keywords, which count more than on the
abstract, which count more than in the body.

This is working reasonably well, but I'm stuck on one counting problem:
what's the natural way to count multiple matches in SQL?

As an example, let's use the title a.k.a. 'name' of an article (called
modules, in the schema).

I've already generating a big select with a stanza for each class of match,
unioned together, weighted and summed. Here's a typical stanza for the name:

select moduleid, name, version, created, revised, abstract,
count(*)*100 as weight
from current_modules cm, abstracts a
where
cm.abstractid = a.abstractid
and (
name ~* 'Fourier'
or name ~* 'series'
)
group by moduleid, name, version, created, revised, abstract

Obviously, this will give one hit on a module with the name 'Fourier Series',
as well as one for 'Fourier Transforms', and one for 'Time Series Analysis'.

It's probably blindingly obvious, but how would I structure this to get
_two_ hits for 'Fourier Series', that'll still scale to, say, a dozen
search terms entered? I've thought of the subselect route, as so:

select moduleid, name, version, created, revised, abstract,
count(*)*100 as weight
from (
select moduleid, abstractid, name, version, created, revised
from current_modules cm
where name ~* 'Fourier'
union all
select moduleid, abstractid, name, version, created, revised
from current_modules cm
where name ~* 'series'
) as bar, abstracts a
where
bar.abstractid = a.abstractid
group by moduleid, name, version, created, revised, abstract

But I'm not sure how well that'll scale, since this is already a subselect,
so I'd be nesting two deep.

Ross

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Josh Berkus 2002-11-04 23:26:12 Re: counting text matches - any recipes?
Previous Message Larry Rosenman 2002-11-04 21:01:10 Re: [SQL] Database Design tool