Re: Complicated query... is there a simpler way?

From: Ryan Mahoney <ryan(at)paymentalliance(dot)net>
To: Joshua Adam Ginsberg <rainman(at)owlnet(dot)rice(dot)edu>, pgsql-general(at)postgresql(dot)org
Subject: Re: Complicated query... is there a simpler way?
Date: 2001-06-27 01:24:24
Message-ID: 5.0.2.1.0.20010627020529.02b20d10@paymentalliance.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

*Disclaimer* - me sleepy, no guarantee of intelligebility or relevence ;)

First thing that strikes me is that you should use varchar datatype for
almost all of your fields with the exception of the body within posts.

If I were working on this project, I would make a table called keywords
where I would store all of the unique words (minus stop words) that are
stored in either the body, author, and first name last name. Then I'd
create a mapping table to map keyword -> body, keyword -> subject, and
keyword -> author.

break the user's search criteria into an array of words. Get rid of the
stop words. For each word, lookup the keyword id. For each matching
keyword_id find the body, subject, and authors that were hit (from mapping
table). Count your hits. Sort your result set by the author, then
subject, then body. Have the message with the documents (general) with the
most keyword hits come up above those documents with fewer.

This type of a search could serve you well depending on the diversity of
your subject matter, specificity of queries, and size of the document
corpus. You can get really complex if you like (this search doesn't take
into account idioms or synonyms or soundex....) - but I have used a similar
search method with much success and little time investment.

If you are interested in this approach, I can send you some PHP files (for
indexing documents and performing searches) and a stored procedure that
performs this type of search. The biggest downside of this method is that
maintaining the keyword and relationships can become a serious management
nightmare, especially if users can revise their posted messages.

Good Luck!

-r

At 12:49 AM 6/27/01 -0500, Joshua Adam Ginsberg wrote:

>I've got a nasty looking search query, and I'm afraid with a large table
>it's going to be horribly inefficient, and I was wondering if anybody
>could think of a way to slim this one down.
>
>I'll simplify my situation as much as I can.
>
>I'm working with two tables here. The first is a user table:
>
>create table users(
>userid integer not null primary key,
>firstnames text not null,
>lastname text not null
>);
>
>The second is a message board table:
>
>create table posts(
>postid integer not null primary key,
>reply_to integer references posts(postid),
>author integer not null references users(userid),
>subject text not null,
>body text not null
>);
>
>I think the only columns which need explaining are postid and reply_to.
>When a new thread is created, a row is inserted into posts with a unique
>postid and a null reply_to. Any subsequent responses to this original post
>are also given unique postid's but their reply_to field references the
>start of the thread.
>
>I'm writing a search function. It would be useful to search by thread, and
>not by individual post. So I've created an aggregate function
>concat(text). I've also created a scoring function that takes in three
>text variables. The first should be the query (which I'll represent as
>$query, since it's provided by the webserver). The second should be the
>subject to search (subject is scored differently than the body) and the
>third should be the body to search.
>
>Here's the query:
>
>select lastname, firstnames, subject, threadid from posts, users, (select
>threadid, concat(body) as thread_body from (select postid as threadid,
>body from posts where reply_to is null union select reply_to as threadid,
>body from posts where reply_to is not null) as inner_subquery group by
>threadid) as outer_subquery where users.userid = posts.author and threadid
>= postid and score_search('$query',subject,thread_body) > 0 order by
>score_search('$query',subject,thread_body);
>
>Forgive me if I have any syntactic errors. I'm translating from my real
>query. My real query's a little larger since my actualy scenario's a
>little more complicated. But that's four actual select calls to produce
>this. Is there a prettier way anybody can think of?
>
>Thanks!
>
>-jag

Attachment Content-Type Size
unknown_filename text/plain 166 bytes

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Thomas T. Veldhouse 2001-06-27 01:27:03 Bug in createlang?
Previous Message Martijn van Oosterhout 2001-06-27 01:19:55 Re: Why is NULL not indexable?