Re: faster INSERT with possible pre-existing row?

From: Mark Lewis <mark(dot)lewis(at)mir3(dot)com>
To: John A Meinel <john(at)arbash-meinel(dot)com>
Cc: newz(at)bearfruit(dot)org, Dan Harris <fbsd(at)drivefaster(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: faster INSERT with possible pre-existing row?
Date: 2005-07-26 22:29:23
Message-ID: 1122416963.17503.25.camel@archimedes
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Easier and faster than doing the custom trigger is to simply define a
unique index and let the DB enforce the constraint with an index lookup,
something like:

create unique index happy_index ON happy_table(col1, col2, col3);

That should run faster than the custom trigger, but not as fast as the
temp table solution suggested elsewhere because it will need to do an
index lookup for each row. With this solution, it is important that
your shared_buffers are set high enough that the happy_index can be kept
in memory, otherwise performance will drop precipitously. Also, if you
are increasing the size of the table by a large percentage, you will
want to ANALYZE periodically, as an optimal plan for a small table may
be a disaster for a large table, and PostgreSQL won't switch plans
unless you run ANALYZE.

-- Mark

On Tue, 2005-07-26 at 14:51 -0500, John A Meinel wrote:
> Matthew Nuzum wrote:
> > On 7/26/05, Dan Harris <fbsd(at)drivefaster(dot)net> wrote:
> >
> >>I am working on a process that will be inserting tens of million rows
> >>and need this to be as quick as possible.
> >>
> >>The catch is that for each row I could potentially insert, I need to
> >>look and see if the relationship is already there to prevent
> >>multiple entries. Currently I am doing a SELECT before doing the
> >>INSERT, but I recognize the speed penalty in doing to operations. I
> >>wonder if there is some way I can say "insert this record, only if it
> >>doesn't exist already". To see if it exists, I would need to compare
> >>3 fields instead of just enforcing a primary key.
> >
> >
> > I struggled with this for a while. At first I tried stored procedures
> > and triggers, but it took very long (over 24 hours for my dataset).
> > After several iterations of rewritting it, first into C# then into
> > Python I got the whole process down to under 30 min.
> >
> > My scenario is this:
> > I want to normalize log data. For example, for the IP address in a log
> > entry, I need to look up the unique id of the IP address, or if the IP
> > address is new, insert it and then return the newly created entry.
> > Multiple processes use the data, but only one process, run daily,
> > actually changes it. Because this one process knows that the data is
> > static, it selects the tables into in-memory hash tables (C#) or
> > Dictionaries (Python) and then does the lookups there. It is *super*
> > fast, but it uses a *lot* of ram. ;-)
> >
> > To limit the ram, I wrote a version of the python code that uses gdbm
> > files instead of Dictionaries. This requires a newer version of Python
> > (to allow a gdbm db to work just like a dictionary) but makes life
> > easier in case someone is using my software on a lower end machine.
> > This doubled the time of the lookups from about 15 minutes to 30,
> > bringing the whole process to about 45 minutes.
> >
>
> Did you ever try the temp table approach? You could:
>
> COPY all records into temp_table, with an empty row for ip_id
> -- Get any entries which already exist
> UPDATE temp_table SET ip_id =
> (SELECT ip_id from ipaddress WHERE add=add)
> WHERE EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);
> -- Create new entries
> INSERT INTO ipaddress(add) SELECT add FROM temp_table
> WHERE ip_id IS NULL;
> -- Update the rest
> UPDATE temp_table SET ip_id =
> (SELECT ip_id from ipaddress WHERE add=add)
> WHERE ip_id IS NULL AND
> EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);
>
> This would let the database do all of the updating work in bulk on it's
> side, rather than you pulling all the data out and doing it locally.
>
> An alternative would be something like:
>
> CREATE TEMP TABLE new_ids (address text, ip_id int);
> COPY all potentially new addresses into that table.
> -- Delete all entries which already exist
> DELETE FROM new_ids WHERE EXISTS
> (SELECT ip_id FROM ipaddresses
> WHERE add=new_ids.address);
> -- Now create the new entries
> INSERT INTO ipaddresses(add) SELECT address FROM new_ids;
>
> -- At this point you are guaranteed to have all addresses existing in
> -- the database
>
> If you then insert your full data into the final table, only leave the
> ip_id column as null. Then if you have a partial index where ip_id is
> NULL, you could use the command:
>
> UPDATE final_table SET ip_id =
> (SELECT ip_id FROM ipaddresses WHERE add=final_table.add)
> WHERE ip_id IS NULL;
>
> You could also do this in a temporary table, before bulk inserting into
> the final table.
>
> I don't know what you have tried, but I know that for Dan, he easily has
> > 36M rows. So I don't think he wants to pull that locally and create a
> in-memory hash just to insert 100 rows or so.
>
> Also, for your situation, if you do keep a local cache, you could
> certainly save the cache between runs, and use a temp table to determine
> what new ids you need to add to it. Then you wouldn't have to pull the
> complete set each time. You just pull new values for entries you haven't
> added yet.
>
> John
> =:->

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Christopher Kings-Lynne 2005-07-27 01:55:47 Re: faster INSERT with possible pre-existing row?
Previous Message Chris Browne 2005-07-26 22:12:10 Re: Cheap RAM disk?