Re: [NOVICE] Perl - Postgres

From: "codeWarrior" <GPatnude(at)adelphia(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: [NOVICE] Perl - Postgres
Date: 2003-01-31 15:22:59
Message-ID: b1e3ub$j5e$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-novice

DOS / WINDOWS:

V:\perl\bin> perldoc pg.pm > pgdoc.txt

-- Then read the text file...

Unix:

/usr/bin/perl/perldoc pg.pm > pddoc.txt

Then there's always the CPAN web site Pg module documentation: (I know it
look bad but this link works...)

http://ls6-www.cs.uni-dortmund.de/cgi-bin/SFgate?language=English&verbose=1&
listenv=DL&application=cpan&convert=CPAN&converthl=&refinequery=application%
3dcpan%26convert%3dCPAN%26database%3dwait%252FDB%252Fcpan%26detex%3d1%26form
extern%3dCPAN%26multiple%3d1%26range%3d1%26verbose%3d1%26name%3dPg.pm%26syno
psis%3d%26text%3d%26author%3d%26tie%3dor%26_lines%3dwhole%2bdocument%26maxhi
ts%3d40%26language%3dEnglish&formintern=&formextern=CPAN&transquery=name%3dp
g.pm&_lines=&multiple=0&descriptor=wait%2fDB%2fcpan%7c1000%7c18068%7cDBD::Pg
%20-%20PostgreSQL%20database%20driver%20for%20the%20DBI%20module%20%7cCPAN%7
cwait%7c%2fservices%2fwww-db%2fWAIT%2fDB%2fcpan%7c301

OR http://www.cpan.org

"Richard A Lough" <ralough(dot)ced(at)dnet(dot)co(dot)uk> wrote in message
news:3E381F1A(dot)2FC108E9(at)dnet(dot)co(dot)uk(dot)(dot)(dot)
> This is a multi-part message in MIME format.
> --------------C73B285B6B29F4468FCAF0DE
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
>
> Rosta Farzan wrote:
> >
> > Hi
> >
> > Where can I find the list of the functions of pg module, for example how
> > to insert a row to the database?
> > Does anybody know a tutorial on perl & Postgres?
> >
> > Thanks
> > Rosta
> >
> I have seen some examples. I believe they came with the Pg module.
>
> I attach some GPL code which was intended for Mysql. My hack is
> probably a good example of how _not_ to interface to Postgres,
> but you will need something to start with. The original article
> should be still on the DrDobbs site somewhere. (try Aug 2002)
>
> HTH
>
> Richard A Lough
> --------------C73B285B6B29F4468FCAF0DE
> Content-Type: application/x-perl;
> name="spider.pl"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
> filename="spider.pl"
>
> #!/usr/bin/perl -w
> #*-------------------------------------------------------------------
> #*- spider.pl:
> #*- Author: Manu Konchady
> #*- License: GPL
> #*- Input: A query
> #*- Output: A list of the hubs and authorities for the query
> #*- Module Dependencies: HTML::Parser 3.0, LWP::UserAgent,
> #*- URI::URL, HTTP::Request, DBI and Posix.
> #*-
> #*- Description: We assume that two tables have been created for this
> #*- particular query to store URLs and associated links
> #*- prior to the execution of the spider. An entry has also
> #*- been made in a queries table for this query.
> #*-
> #*- An initial collection of URLs are stored in the URL table
> #*- for the query followed by a parallel execution of spiders
> #*- to selectively generate the web tree for the query.
> #*- After all spiders have terminated, the hubs and authorities
> #*- for the query are listed. The list of all URLs sorted by
> #*- relevancy can also be viewed.
> #*-
> #*- Notes: This code is a sample and should be modified as
> #*- needed. For example, relevancy can be computed using several
> #*- options as described in the article. Links can be pruned or
> #*- followed depending on user criteria. Priorities can be
> #*- set for certain domains. Entity tables can be used to track
> #*- entities across web pages.
> #*-
> #*- RAL hacked to do postgres instead of MySQL
> #*- tables are utab, queries, ltab
> #*- ( url varchar(120), status varchar(10), site varchar(120), relevancy
float4, level int4, spider int4)
> #*- ( query varchar(250), qword varchar(250), url_limit int4, url_count
int4, status varchar(10))
> #*- the "query" is the text to match, the qword is a name for reference
and
> #*- which was used to set the tables up
> #*- ( url varchar(250), link varchar(250))
> #*-------------------------------------------------------------------
>
> use POSIX ":sys_wait_h";
> use PgDBI;
> use MyWeb;
> use strict;
>
> my $dbname = 'querynet1';
> my ($i, @pid, $done, $num_spiders, $kid, $query, %urls, $count);
> my ($qword, $utab, $ltab, $dbh, $sth, $command, $url_limit, $url);
>
> #*-- The first argument is a table prefix associated with the query.
> #*-- It is used to assign the url and link tables for the query.
> #*-- The database name for the db connection must be entered.
> $qword = $ARGV[0];
> $utab = "$qword" . "_u"; $ltab = "$qword" . "_l";
> $dbh = PgDBI::connect_db("dbname=$dbname");
>
> #*-- get the full query and maximum number of urls for the query
> $command = "select query, url_limit from queries where qword = '$qword'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> ($query, $url_limit) = PgDBI::fetch_row($sth);
>
> #*-- get an initial set of urls from Google for the query
> &call_google();
> #*-- close the database in ccase google has crashed it
>
> PgDBI::disconnect_db($dbh, $sth);
>
> #*-- start the parallel operation with 4 spiders, this
> #*-- works on Linux. On a Windows platform, you may need
> #*-- to use the Win32 module to start individual processes
> #$num_spiders = 1;
> $num_spiders = 4;
> for $i (1..$num_spiders)
> {
> if ($pid[$i] = fork()) #*-- the parent code continues
> { next; }
> else #*-- the child code continues
> { defined($pid[$i]) or die "fork: $!"; &fetch_text($i); }
> }
> ####################################################################
> #*-- wait till all the spiders are complete. Every two seconds,
> #*-- check the status of the spiders
> $done = 0;
> while (!($done))
> {
> for $i (1..$num_spiders)
> {
> next if ($pid[$i] == 0);
> $kid = waitpid($pid[$i], &WNOHANG);
> $pid[$i] = 0 if ($kid == -1);
> }
> $done = 1;
> for $i (1..$num_spiders) { $done = 0 if ($pid[$i] != 0); }
> sleep(2);
> }
>
> #*-- clean up after the spiders, find all URLs which were
> #*-- not processed and delete them from the link table
>
> $dbh = PgDBI::connect_db("dbname=$dbname");
> $command = "select url from $utab where status != 'd'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> while ( ($url) = PgDBI::fetch_row($sth) ) { $urls{$url}++; }
> foreach $url (keys %urls)
> { $command = "delete from $ltab where url = '$url'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> }
>
> #*-- delete the unprocessed urls from the url table
> $command = "delete from $utab where status != 'd'";
> $sth = PgDBI::execute_stmt($dbh, $command);
>
> $command = "update queries set status = 'done' where qword = '$qword'";
> $sth = PgDBI::execute_stmt($dbh, $command);
>
> #*-- print the top 10 hubs and authorities
> print ("List of top 10 Hubs\n"); $i = 1;
> $command = "select count(*), url from $ltab group by url order by 1
desc";
> $sth = PgDBI::execute_stmt($dbh, $command);
> while ( ($count, $url) = PgDBI::fetch_row($sth) )
> { print ("$i. $url\n"); $i++; last if ($i == 11); }
>
> print ("List of top 10 Authorities\n"); $i = 1;
> $command = "select count(*), link from $ltab group by link order by 1
desc";
> $sth = PgDBI::execute_stmt($dbh, $command);
> while ( ($count, $url) = PgDBI::fetch_row($sth) )
> { print ("$i". "- "." $url\n"); $i++; last if ($i == 11); }
>
> PgDBI::disconnect_db($dbh, $sth);
>
> exit(0);
>
>
> #*------------------------------------------------------------
> #*- Each of the spiders run the following code and then terminate.
> #*- An unprocessed URL is selected and traversed. New links
> #*- are added to the tables. The spider number is passed
> #*- as a parameter
> #*------------------------------------------------------------
> sub fetch_text()
> {
>
> my ($snum) = @_;
> my ($dbh, $sth, $done, $command, $count, $links, $ref_text, $usite);
> my ($ref_links, $url, $level, $link, %links, $site, $url_temp);
>
> #*-- create a handle for the database
> $dbh = PgDBI::connect_db("dbname=$dbname");
> $done = 0;
> while (!($done))
> {
> $link = '';
> ## $command = "lock tables $utab write, queries write"; # no good in
postgres
> # $command = "begin work;"; # ral new
> # $sth = PgDBI::execute_stmt($dbh, $command);
> # $command = "lock table $utab"; # ral new
> # $sth = PgDBI::execute_stmt($dbh, $command);
> # $command = "lock table queries"; # ral new
> # $sth = PgDBI::execute_stmt($dbh, $command);
>
> #*-- check if there are any unprocessed URLs
> $command = "select url, level, site from $utab where status = ''"; #
Pg
> $sth = PgDBI::execute_stmt($dbh, $command);
> ($url, $level, $usite) = PgDBI::fetch_row($sth);
> $command = "select count(*) from $utab where status = ''"; # Pg
> $sth = PgDBI::execute_stmt($dbh, $command);
> ($count) = PgDBI::fetch_row($sth);
>
> if ($count == 0)
> { $done = 1; next; } # Pg
>
> #*-- update the number of URLs processed in the queries table
> $command = "select count(*) from $utab where status = 'd'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> ($count) = PgDBI::fetch_row($sth);
> $command = "update queries set url_count = $count where qword =
'$qword'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> #*-- check if the number of processed URLs exceeds the limit
> if ($count > $url_limit)
> # { $done = 1; $sth = PgDBI::execute_stmt($dbh, "unlock tables");
next; } # no good in postgres
> { $done = 1;
> # $sth = PgDBI::execute_stmt($dbh, "commit work"); #
> next; } # Pg
>
> #*-- set the status of the URL and unlock the tables
> $url_temp = $url;
> $url =~ s/'/\\'/g; $url =~ s/\$/\\\$/;
> $command = "update $utab set status = 'v' where url = '$url'";
> $sth = PgDBI::execute_stmt($dbh, $command);
>
> #*-- if this is an irrelevant URL, then remove it
> $url = $url_temp;
> ($ref_text, $ref_links) = MyWeb::parse_URL($url);
> if (!(&match($$ref_text, $query) ) )
> {
> $command = "delete from $utab where url = '$url'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> if ($link eq '') { next; } else {
> $command = "delete from $ltab where url = '$url'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> next;}
> }
>
> #*-- add all links for this relevant URL to $ltab
> $level++; %links = %$ref_links;
> foreach $link (keys %links)
> {
>
> ($site = $link) =~ s#/.*$##i;
> #*-- skip links to ads and other formatted documents
> next if ($link =~ /doubleclick/i);
> next if ($link =~ /affiliates\./i);
> next if ($link =~ /\.(jpg|gif|pdf|ps|doc|xls|ppt|mpg|mpeg)$/i);
> #*-- skip links to the same site, except when one of the
> #*-- query keywords matches the link
> next if ( ($site eq $usite) && (!(&match_link($link, $query))) );
> $link =~ s/'/\\'/g; $link =~ s/\$/\\\$/;
> $link = lc($link);
> #*-- check for duplicates
> $command = "select count(*) from $utab where url = '$link'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> ($count) = PgDBI::fetch_row($sth);
> next if ($count != 0);
> #*-- check for too many URLs from the same site
> $command = "select count(*) from $utab where site = '$site'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> ($count) = PgDBI::fetch_row($sth);
> next if ($count > 200);
> #*-- create entries in the link and url tables
> $command = "insert into $ltab values ('$url', '$link')";
> $sth = PgDBI::execute_stmt($dbh, $command);
> $command = "insert into $utab values ('$link', '','$site', 0.0,
$level, $snum)";
> $sth = PgDBI::execute_stmt($dbh, $command);
>
> }
>
> #*-- set the status and relevancy of the URL, text from
> #*-- the web page can be used to set relevancy values
> $command = "update $utab set status = 'd' where url = '$url'";
> $sth = PgDBI::execute_stmt($dbh, $command);
> $command = "update $utab set relevancy = 0.2 where url = '$url'";
> $sth = PgDBI::execute_stmt($dbh, $command);
>
> }
> PgDBI::disconnect_db($dbh, $sth);
> sleep (3);
> exit(0);
>
> }
>
> #*------------------------------------------------------
> #*- check if query satisfies the boolean query.
> #*------------------------------------------------------
> sub match()
> {
> my ( $text, $query_terms) = @_;
> my ($word, @words, $quotes, $retval);
>
> $quotes = '"\'';
> #*-- split a query with the AND operator and recursively
> #*-- parse the query
> if ($query_terms =~ /\bAND\b/i)
> {
> @words = split(/\bAND\b/i, $query_terms);
> foreach $word (@words)
> {
> if ($word =~ /\((.*?)\)/)
> { $retval = &match($text, $1);
> if ($retval == 0) { return(0); } else { next; }
> }
> $word =~ s/[$quotes]//g; $word =~ s/^\s+//; $word =~ s/\s+$//;
> return(0) if ($text !~ /\b$word\b/i);
> }
> return(1);
> }
>
> #*-- split a query with the OR operator and recursively
> #*-- parse the query
> if ($query_terms =~ /\bOR\b/i)
> {
> @words = split(/\bOR\b/i, $query_terms);
> foreach $word (@words)
> {
> if ($word =~ /\((.*?)\)/)
> { $retval = &match($text, $1);
> if ($retval == 1) { return(1); } else { next; }
> }
> $word =~ s/[$quotes]//g; $word =~ s/^\s+//; $word =~ s/\s+$//;
> return(1) if ($text =~ /\b$word\b/i);
> }
> return(0);
> }
>
> #*-- assume an AND operator for a query with multiple terms
> $query_terms =~ s/[$quotes]//g; $query_terms =~ s/^\s+//; $query_terms
=~ s/\s+$//;
> $query_terms =~ s/([()])/\\$1/g;
> @words = split(/\s+/, $query_terms);
> foreach $word (@words)
> { return(0) if ($text !~ /\b$word\b/i); }
> return(1);
>
> }
>
>
> #*------------------------------------------------------
> #*- check if any words in the query match the query text
> #*------------------------------------------------------
> sub match_link()
> {
> my ($link, $query) = @_;
>
> my (@words, $word, $match);
>
> $match = 0;
> @words = split(/\s+/, $query);
> foreach $word (@words)
> {
> next if ( ($word =~ /\bAND\b/i) || ($word =~ /\bOR\b/i) || ($word =~
/\bNOT\b/i) );
> $match = 1 if ($link =~ /$word/i);
> }
>
> return($match);
>
> }
>
>
> #*-------------------------------------------
> #*- load the initial set of URLs from Google
> #*-------------------------------------------
> sub call_google()
> {
>
> my ($sql, $query, $url_string, $reg, $j, $script, $ref_html);
> my ($output, $site, %urls, $url);
> my ($count) = 0;
>
> $sth = PgDBI::execute_stmt($dbh, "delete from $utab");
> $sth = PgDBI::execute_stmt($dbh, "delete from $ltab");
>
> #*-- get the full query
> $sql = "select query from queries where qword = '$qword'";
> $sth = PgDBI::execute_stmt($dbh, "$sql");
> ($query) = PgDBI::fetch_row($sth);
>
> #*-- get urls from Google
> $query =~ s/(\W)/sprintf("%%%x", ord($1))/eg;
> $url_string =
"http://www.google.com/search?num=100&safe=off&btnG=Google+Search";
> $url_string .= "&q=$query&filter=0&start=";
>
> $reg = '<p><a href=http://(.*?)>';
> undef $/;
> for ($j = 0; $j < 46; $j += 50)
> {
> $script = "$url_string$j";
> ($ref_html) = MyWeb::get_URL($script);
> $output = $$ref_html; #*-- get all the data
> while ($output =~ /$reg/sgi)
> { ($url = $1) =~ tr/A-Z/a-z/; $urls{$url}++; }
> sleep(3);
> }
>
> #*-- put the urls in a table
> foreach $url (keys %urls)
> {
> $count++;
> ($site = $url) =~ s#/.*$##i; $url =~ s/'/\\'/g; $url =~ s/\$/\\\$/;
> $command = "insert into $utab values ('$url', '', '$site', 0.0, 0,
0)";
> $sth = PgDBI::execute_stmt($dbh, $command);
> }
>
> }
>
> --------------C73B285B6B29F4468FCAF0DE
> Content-Type: text/plain
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> MIME-Version: 1.0
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>
> --------------C73B285B6B29F4468FCAF0DE--
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message codeWarrior 2003-01-31 15:25:42 Re: Query gone wild
Previous Message Peter Childs 2003-01-31 15:19:29 History

Browse pgsql-novice by date

  From Date Subject
Next Message Insyde 2003-01-31 16:33:01 Problem when adding an existing primary key
Previous Message Aarni Ruuhimäki 2003-01-31 13:41:21 Re: where is libpq.so