Quick Links

Re: How to use full-text search URL parser to filter query results by domain name?

From:	Jess Wren <jess(dot)wren(at)interference(dot)cc>
To:	Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: How to use full-text search URL parser to filter query results by domain name?
Date:	2019-04-10 08:57:22
Message-ID:	9c7e4b2e-6b0b-70b4-f96e-784a1803a5eb@interference.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 4/8/19 4:50 AM, Arthur Zakirov wrote:
> I think it is normal to use ts_parse(). And I suppose you might use
> windows functions.
>
> For example, you have table links:
>
> =# create table links (score int, link text);
> =# insert into links values
> (1, 'http://www.foo.com/bar'),
> (2, 'http://www.foo.com/foo'),
> (2, 'http://www.bar.com/foo'),
> (1, 'http://www.bar.com/bar');
>
> You can use the following query:
>
> =# with l as (
> select score, token, link,
>     rank() over (partition by token order by score) as rank
> from links,
>     lateral ts_parse('default', link)
> where tokid = 6)
> select score, token, link from l where rank = 1;
> score |    token    |          link
> -------+-------------+------------------------
>      1 | www.bar.com | http://www.bar.com/bar
>      1 | www.foo.com | http://www.foo.com/bar
>

Thank you very much Arthur. Your suggestion led me to a query that is at
least returning correct result set. I could not figure out how to get
your rank() function to work with my query, but building on your answer
(and others from IRC etc), I ended up with the following solution:

First I created the following views:

|CREATE VIEW scored_pages AS ( SELECT crawl_results.crawl_id,
crawl_results.score, crawl_results.page_id, pages.url FROM crawl_results
JOIN pages ON crawl_results.page_id = pages.id ); CREATE VIEW
scored_links AS ( SELECT scored_pages.score, links.source, links.target,
links.link_text FROM links JOIN scored_pages ON scored_pages.url =
links.source );|

Then, using these views, I did the following query to extract the links
from the lowest scored pages in the results:

||SELECTscore,host,target FROM(SELECTDISTINCTON(token)token
AShost,score,target FROMscored_links,LATERAL
ts_parse('default',target)WHEREtokid =6ORDERBYtoken,score )asx
WHERENOTEXISTS(SELECTpp.id FROMpages pp WHEREtarget=pp.url)ORDERBYscore; ||

Does this seem like a reasonable approach? When running EXPLAIN on this
query, I get the following:

                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Sort (cost=1252927.46..1252927.47 rows=1 width=100)
   Sort Key: crawl_results.score
   -> Hash Anti Join (cost=1248297.18..1252927.45 rows=1 width=100)
         Hash Cond: ((links.target)::text = (pp.url)::text)
         -> Unique (cost=1247961.08..1252591.28 rows=5 width=100)
               -> Sort (cost=1247961.08..1250276.18 rows=926040 width=100)
                     Sort Key: ts_parse.token, crawl_results.score
                     -> Gather (cost=1449.79..1054897.20 rows=926040 width=100)
                           Workers Planned: 2
                           -> Hash Join (cost=449.79..961293.20 rows=385850 width=100)
                                 Hash Cond: ((links.source)::text = (pages.url)::text)
                                 -> Nested Loop (cost=0.00..955091.41 rows=378702 width=144)
                                       -> Parallel Seq Scan on links (cost=0.00..4554.40 rows=75740 width=112)
                                       -> Function Scan on ts_parse (cost=0.00..12.50 rows=5 width=32)
                                             Filter: (tokid = 6)
                                 -> Hash (cost=404.67..404.67 rows=3609 width=63)
                                       -> Hash Join (cost=336.10..404.67 rows=3609 width=63)
                                             Hash Cond: (crawl_results.page_id = pages.id)
                                             -> Seq Scan on crawl_results (cost=0.00..59.09 rows=3609 width=12)
                                             -> Hash (cost=291.60..291.60 rows=3560 width=59)
                                                   -> Seq Scan on pages (cost=0.00..291.60 rows=3560 width=59)
         -> Hash (cost=291.60..291.60 rows=3560 width=55)
               -> Seq Scan on pages pp (cost=0.00..291.60 rows=3560 width=55)
(23 rows)

I am wondering if there is a more efficient way to do things? Some
people on IRC mentioned that it might be better to declare a scalar
function to return the host from ts_parse instead of the LATERAL query
... but I couldn't figure out how to do that, or if it was even
preferable to the above from a performance standpoint ... any ideas on
how I could improve the above.

In response to

Re: How to use full-text search URL parser to filter query results by domain name? at 2019-04-08 11:50:33 from Arthur Zakirov

Responses

Re: How to use full-text search URL parser to filter query results by domain name? at 2019-04-10 18:55:50 from Michel Pelletier

Browse pgsql-general by date

	From	Date	Subject
Next Message	Thomas Kellerer	2019-04-10 12:39:22	pg_indexes doesn't show indexes for partitioned tables - bug or intended?
Previous Message	Tom Lane	2019-04-10 04:10:32	Re: Does pg_stat_get_live_tuples() matter?