Re: stored proc and inserting hundreds of thousands of rows

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Joel Reymont <joelr1(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: stored proc and inserting hundreds of thousands of rows
Date: 2011-05-01 01:00:34
Message-ID: BANLkTi=ML2MbgEP-pRdAsHaCFp9s3w+DWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Apr 30, 2011 at 5:12 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

>
>
> gist indices are designed to make this type of thing fast, by using
> techniques to rule out most of those comparisons without actually
> performing them. I don't know enough about the
> guts of either your distance function or the gist indexes to know if
> you can do it this way, but if you can it would certainly be the way
> to go.
>

It is definitely a good idea to consider a gist index for eliminating most
of a large dataset, if applicable. Do a little reading on the topic and,
hopefully, it's applicability (or not) will become apparent.

However, as someone who has built a number of ad servers over the years, for
several of the larger ad networks, the first thing I'd do is separate your
ad serving from real-time interaction with your database, no matter what the
underlying technology. If you ad serving is dependent on your database, it
means that hiccups in the database will have an impact on ad serving, which
is rarely tolerable. And god forbid you should need to take the db down for
a period of maintenance. The reliability and performance required of most ad
servers is often greater than what should reasonably be expected of a
relational database, particularly if there are other processes accessing the
database, as is the case with your system. The first rule of ad serving is
that no outage of backend systems should ever be able to prevent or impact
front end ad serving. Some kind of in-memory cache of doc/ad mappings which
the ad server interacts with will serve you in good stead and will be much
easier to scale horizontally than most relational db architectures lend
themselves to. If you have an ever increasing set of documents and ads,
you'll inevitably wind up 'sharding' your dataset across multiple db hosts
in order to maintain performance - which creates a number of maintenance
complexities. Easier to maintain a single database and do analytics over a
single data source, but insulate it from the real-time performance
requirements of your ad serving. Even something as simple as a process that
pushes the most recent doc/ad mappings into a memcache instance could be
sufficient - and you can scale your memcache across as many hosts as is
necessary to deliver the lookup latencies that you require no matter how
large the dataset. Similarly, if you are updating the database from the ad
server with each ad served in order to record an impression or click, you'll
be far better off logging those and then processing the logs in bulk on a
periodic basis. If subsequent impressions are dependent upon what has
already been served historically, then use your memcache instance (or
whatever structure you eventually choose to utilize) to handle those
lookups. This gives you the convenience and flexibility of a relational
system with SQL for access, but without the constraints of the capabilities
of a single host limiting real-time performance of the system as a whole.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Phoenix Kiula 2011-05-01 06:48:42 The right SHMMAX and FILE_MAX
Previous Message Jeff Janes 2011-05-01 00:12:15 Re: stored proc and inserting hundreds of thousands of rows