Re: stored proc and inserting hundreds of thousands of rows

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Joel Reymont <joelr1(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: stored proc and inserting hundreds of thousands of rows
Date: 2011-04-30 22:11:56
Message-ID: BANLkTikrXYJpJGzsUtpFgS8YiAO+d7y6PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Apr 30, 2011 at 2:15 PM, Joel Reymont <joelr1(at)gmail(dot)com> wrote:
>
> On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote:
>
>> If this is where most of the time is, the next thing is to run it
>> with EXPLAIN ANALYZE, and post the output.
>
> I was absolutely wrong about the calculation taking < 1s, it actually takes about 30s for 2 million rows.
>
> Still, the difference between 5 minutes and 30s must be the insert.

But what exactly are you inserting? The queries you reported below
are not the same as the ones you originally described.

In particular, they do not seem to use the "threshold" parameter that
the original ones did, whose job is presumably to cut the 2 million
down to a much smaller number that meet the threshold. But how much
smaller is that number? This will have a large effect on how long the
insert takes.

...

> Digging deeper into the distance function,
>
> EXPLAIN ANALYZE VERBOSE
> SELECT *
> FROM (SELECT id, divergence(<array above>, topics) AS distance FROM docs) AS tab
> WHERE tab.distance <= 50.0;
>
> Subquery Scan on tab  (cost=0.00..383333.00 rows=666653 width=12) (actual time=0.027..20429.299 rows=2000002 loops=1)
>  Output: tab.id, tab.distance
>  Filter: (tab.distance <= 50::double precision)
>  ->  Seq Scan on public.docs  (cost=0.00..358333.50 rows=1999960 width=36) (actual time=0.025..19908.200 rows=2000002 loops=1)
>        Output: docs.id, divergence((<array above>::double precision[])::topics, docs.topics)

It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
anything. Are you sure the parameter shouldn't be <=0.50 instead?

Also, you previously said you didn't mind of this process took a
couple minutes, as long as it didn't interfere with other things going
on in the database. So you probably need to describe what those other
things going on in the database are.

Also, you might have a data correctness problem. If the plan is to
scan new ads against all docs, and new docs against all ads; then if
new rows are added to each table during overlapping transaction, the
new ads against new docs comparison will not actually happen. You
will probably need to add manual locking to get around this problem.

Cheers

Jeff

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joel Reymont 2011-04-30 22:29:25 Re: stored proc and inserting hundreds of thousands of rows
Previous Message Joel Reymont 2011-04-30 21:22:24 Re: stored proc and inserting hundreds of thousands of rows