Quick Links

How to use full-text search URL parser to filter query results by domain name?

From:	Jess Wren <jess(dot)wren(at)interference(dot)cc>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	How to use full-text search URL parser to filter query results by domain name?
Date:	2019-04-07 04:06:26
Message-ID:	2a782f3e-7473-43ce-aad5-1de31cf2b31a@interference.cc
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

I am trying to understand how to use the full-text search parser for
URLS and hostnames to filter results from a text field containing URLS
based on domain, and also how to index text columns for fast
lookup/matching based on domain.

I have a PostgreSQL database containing documents and links downloaded
by a web crawler, with the following tables:

pages

----------

id: Integer (primary key)

url: String (unique)

title: String

text: String

html: String

last_visit: DateTime

word_pos: TSVECTOR

links

----------

id Integer (primary key)

source: String

target: String

link_text: String

UNIQUE(source,target)

crawls

---------

id: Integer (primary key)

query: String

crawl_results

-------------

id: Integer (primary key)

score: Integer (constraint 0<=score<=1)

crawl_id: Integer (foreign key, crawls.id)

page_id: Integer (foreign key, pages.id)

The `source` and `target` fields in the `links` table contain URLs. I am
running the following query to extract scored links from the top-ranking
search results, for pages that haven't been fetched yet:

WITH top_results AS

(SELECT page_id, score FROM crawl_results

WHERE crawl_id=$1

ORDER BY score LIMIT 100)

SELECT top_results.score, l.target

FROM top_results

JOIN pages p ON top_results.page_id=p.id

JOIN links l on p.url=l.source

WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)

However, *I would like to filter these results so that only one row is
returned for a given domain (the one with the lowest score)*. So for
instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
'http://www.foo.com/zor'), I only want the first because it has same
domain `foo.com` and has the lower score.

I was able to find documentation for the builtin full text search
parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
which can parse URLS and extract the hostname. For instance, I can
extract the hostname from a URL as follows:

SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;

token

-------------

www.foo.com

(1 row)

However, I can't figure out how I would integrate this into the above
query to filter out duplicate domains from the results. And because this
is the docs for "testing and debugging text search
<https://www.postgresql.org/docs/11/textsearch-debugging.html#TEXTSEARCH-PARSER-TESTING>",
I don't know if this use of `ts_parse()` is even related to how the URL
parser is intended to be used in practice.

How would I use the "host" parser in my query above to return one row
per domain? Also, how would I appropriately index the "links" table for
"host" and "url" token lookup?

Thanks!

Responses

Re: How to use full-text search URL parser to filter query results by domain name? at 2019-04-07 06:42:06 from hamann.w
Re: How to use full-text search URL parser to filter query results by domain name? at 2019-04-08 11:50:33 from Arthur Zakirov

Browse pgsql-general by date

	From	Date	Subject
Next Message	hamann.w	2019-04-07 06:42:06	Re: How to use full-text search URL parser to filter query results by domain name?
Previous Message	senor	2019-04-07 00:47:59	Re: pg_upgrade --jobs