Re: [SQL] Joining bug????

From: "Gene Selkov Jr(dot)" <selkovjr(at)mcs(dot)anl(dot)gov>
To: Samir Aggarwal <aggarwal(at)dti(dot)net>
Cc: pgsql-interfaces(at)postgreSQL(dot)org
Subject: Re: [SQL] Joining bug????
Date: 1998-10-30 01:11:30
Message-ID: 199810300110.TAA04192@antares.mcs.anl.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-interfaces

> hi Gene,
> Would you mind sharing some code samples, so i could attempt to do
> the same thing...

OK, but first let me describe my database setup. Yours may be entirely different, to the extent it would be difficult to justify using such method. Basically, I use the tables in postgres database not as a primary storage but as multi-key indices to objects stored elsewhere. That's similar to how postgres handles large objects. The difference is that the objects I have to deal with are rather complex and usually don't have adequate relational representation, but can be parsed to extract and index individual attributes. Each attribute is stored in a separate table together with its location data. It is sort of a keyword search system.

In this setup, the task of retrieving an object that has an attribute with particular value, or a range of values, is trivial, but it is not so if you are interested in using multi-attribute constraints. That task would be better addressed by a multi-column table, but since my objects do not conform to any particular format and there are random combinations of more than 400 attributes, I can't afford a single 400-column table with millions of variable length cells, some of them recursively structured.

So, it seems like the only way of retrieving the right objects is to use multiple joins on the tables representing individual attributes. I still have to find out whether I can trust postgres optimizer to do a join between 20 tables, or even 10 -- that's the most typical constraint size that I have to deal with.

> I am in the process of getting more in tune with perl hashes.

~~~~~~~~~~~~~~~~~~~~
1. If you did not use perl hashes before, here's the idea: a hash object in perl is a two-column table, with unique hash-indexed values in the first column, from which you can extract one row at a time, using an equivalent of the '=' operator. It's pretty restrictive, but the advantage of using hashes in perl to post-process your SQL query is that they are incredibly fast. To place a value in the hash %h, you use the assignment $h{'key'} = 'value', or $h{$key} = $value if you want to use variables. To retrieve the value back, use expressions like $val = $h{'a'}; to test for existence of a key, use something like this: if ($h{'a'}) { ... }

This illustrates the second advantage of hashes: easy programming (although a bit cryptic for the beginners).

Now, here is the simplest example: your client program obtained one set of keys in array @a and the second set in array @b (using Pg.pm or any other client library). Intersection of @a and @b is computed like this (a classical perl example): http://www.pun.org/bram/Class/Perl/2/Lab_Hash.html

~~~~~~~~~~~~~~~~~~~~
2. For the real life example of the use of intersection with multiple attributes, please refer to my metabolic pathways page: http://wit.mcs.anl.gov/MPW/
The CGI that is doing this job is here: http://wit.mcs.anl.gov/~selkovjr/intersect.tgz
The script was is done in an extremely ad hoc fashion, with multiple repetitions in the code, but the essence is this (the rest is preparation of the queries and error handling):

For each query term, you get the result (key set) and place it into a hash, %match:

if ( $result->ntuples > 0 ) {
for $i ( 0 .. $result->ntuples - 1 ) {
$match{$result->getvalue($i, 0)}++;
}
}
$terms++;

After all individual queries are executed, each key in %match will point to the number of times that key was returned by the queries. Naturally, this number must be equal to the number of query terms in order for that key to be supplied to the output:

$cgi->print( "<p><pre>\n");
foreach ( keys %match ) {
if ( $match{$_} == $terms ) {
$cgi->print( "<a href=\"/WIT2/CGI/pw.cgi?&pw=$_\">$_</a>\n");
}
}

This example is very simple, and although it does its job for the fixed-type query form, where all criteria are supposed to be ANDed, it is not generic enough. Here is a more involved one:

~~~~~~~~~~~~~~~~~~~~~~
3. Using boolean expressions to control set operations on the query results. Example is this (hope your mailer does not screw the url because of its length):

http://wit.mcs.anl.gov/EMP/select_emp_advanced.cgi?E1.ec_code=ec&E1.ec_code.op=%7E%09is+in+range&E1.ec_code.patt=1.1&ec_code.count=1&T1.text=ccc&T1.text.op=%7E*%09matches+regex.%2C+case-insensitive&T1.text.patt=bacteria&T2.text=tax&T2.text.op=%7E*%09matches+regex.%2C+case-insensitive&T2.text.patt=fungi&T3.text=phd&T3.text.op=%7E*%09matches+regex.%2C+case-insensitive&T3.text.patt=v&text.count=1&N1.seg=pho&N1.seg.op=%7E%09contained+in&N1.seg.patt=7.5+..+8&seg.count=1&constraint=%28T1+%7C+T2%29+%26+%28N1+%26%26+%21T3%29+%26+E1&do=Run+the+query

The CGI for doing this kind of queries and its support modules can be obtained here: http://wit.mcs.anl.gov/~selkovjr/boolean.tgz
To build the parser, unpack the archive and cd to boolean/, then type

perl Makefile.PL; make

It will build the module BoolParser.pm, also available at http://wit.mcs.anl.gov/~selkovjr/BoolParser.pm

The upper directory (where CGI runs) has a symlink to BoolParser.pm (to avoid global installation), but you can substitute it with the real BoolParser.pm. This module is used by Parser.pm which is used by the cgi script and is sort of object-oriented encapsulation for BoolParser.pm, and this is where user actions for set operations are implemented, for example, this rule in BoolParser.y

| expr STRICT_AND expr {
$$ = scalar:SELF->actionStrictAnd($1, $3);
}

invokes the method actionStrictAnd() in Parser.pm, which does the actual job (doesn't it look like an example from the Camel book?):

sub actionStrictAnd {
my ($self, $ar1, $ar2) = @_;
grep ($tmp{$_}++, @$ar2);
@result = grep ($tmp{$_}, @$ar1);
$self->{SIZE} = $#result;
return \(at)result;
}

Oops! Almost forgot to say that you will need http://wit.mcs.anl.gov/~selkovjr/camel-1.24.tar.gz to build BoolParser.pm from BoolParser.y (It is actually bison-1.24 disguised as "camel". The original bison can't build parsers in perl)

This is a very quick hack and I did not give it a lot of testing (imagine, I had two days left before the publication deadline when I realized that multiple joins are doing me no good). Also, doing such things on the client side is a complete heresy as far as SQL goes, but, considering the circumstances, I think it is not an unreasonable solution. It never let me down and I never had to wait more than a few seconds for the most complex queries.

--Gene

Browse pgsql-interfaces by date

  From Date Subject
Next Message Ken J. Wright 1998-10-30 03:22:12 loosing connection with serial port connection
Previous Message Byron Nikolaidis 1998-10-29 15:01:20 Re: Standalone driver for Unix