Re: stored proc and inserting hundreds of thousands of rows

From: Joel Reymont <joelr1(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: stored proc and inserting hundreds of thousands of rows
Date: 2011-04-30 22:29:25
Message-ID: 76A45DA2-2AC3-4F58-933B-D4DCD61B8DCE@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote:

> But what exactly are you inserting? The queries you reported below
> are not the same as the ones you originally described.

I posted the wrong query initially. The only difference is in the table that holds the probability array.

I'm inserting document id and ad id pairs to show that this ad is not linked to this document. The mapping table has a primary key on the serial document id.

> In particular, they do not seem to use the "threshold" parameter that
> the original ones did, whose job is presumably to cut the 2 million
> down to a much smaller number that meet the threshold. But how much
> smaller is that number?

The 5 minutes is with a threshold large enough to be irrelevant. I would like to optimize the process before I apply the threshold to cut down the number of rows.

> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
> anything. Are you sure the parameter shouldn't be <=0.50 instead?

No, ignore the threshold for now.

> Also, you previously said you didn't mind of this process took a
> couple minutes, as long as it didn't interfere with other things going
> on in the database. So you probably need to describe what those other
> things going on in the database are.

Those other things are ad serving which boils down to a lookup of ad ids linked to the document.

This is a lookup from the mapping table using the primary key that goes on at the same time as a large number of <doc,ad> mappings are being inserted into the same table.

Documents are uploaded into the system at a rate of 10k per day, once every couple of seconds. I wish I could get rid of storing the <doc,ad> mapping as that table is gonna grow absolutely huge when each new ad matches tens or hundreds of thousands of documents.

I don't think I can do the matching when serving an ad, though, as I will still need to scan millions of probability vectors (one per doc) to calculate the distance between current document and existing ads.

Then again, the number of ads in the system will always be a fraction of the number of documents so, perhaps, the matching of document to ads can be done at runtime.

> Also, you might have a data correctness problem. If the plan is to
> scan new ads against all docs, and new docs against all ads;

That's basically it.

As new ads are entered, they need to be matched with existing documents.

As new documents are entered, they need to be matched with existing ads.

Both ads and docs are represented by probability vectors of 150 floats so it's the same distance calculation.

> then if new rows are added to each table during overlapping transaction, the
> new ads against new docs comparison will not actually happen. You
> will probably need to add manual locking to get around this problem.

I'll ponder this, thanks for pointing it out!

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2011-05-01 00:12:15 Re: stored proc and inserting hundreds of thousands of rows
Previous Message Jeff Janes 2011-04-30 22:11:56 Re: stored proc and inserting hundreds of thousands of rows