Re: Duplicate deletion optimizations

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Marc Eberhard <eberhardma(at)googlemail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Duplicate deletion optimizations
Date: 2012-01-06 20:38:30
Message-ID: CAEV0TzDgzktQaqvb+8h_WDibk3V==q+6Syf=vM4J6vtSke7M3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard <eberhardma(at)googlemail(dot)com>wrote:

> Hi Samuel!
>
> On 6 January 2012 20:02, Samuel Gendler <sgendler(at)ideasculptor(dot)com> wrote:
> > Have you considered doing the insert by doing a bulk insert into a temp
> > table and then pulling rows that don't exist across to the final table in
> > one query and updating rows that do exist in another query? I did a very
> > brief scan of the SO thread and didn't see it suggested. Something like
> > this:
> >
> > update stats_5mn set count = count + t.count
> > from temp_table t
> > where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
> > stats_5mn.output_id = t.output_id;
> >
> > insert into stats_5mn
> > select * from temp_table t
> > where not exists (
> > select 1 from stats_5mn s
> > where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
> > t.output_id
> > );
> >
> > drop table temp_table;
>
> Am I right to assume that the update/insert needs to be placed into a
> begin / end transaction block if such batch uploads might happen
> concurrently? Doesn't seem to be the case for this question here, but
> I like the solution and wonder if it works under more general
> circumstances.
>

yes, assuming you are concerned about making the insertion atomic.
Obviously, a failure in the second query after success in the 1st query
would be problematic outside of a transaction, since any attempt to repeat
the entire operation would result in repeated updates.

> What's the overhead of creating and dropping a temporary table? Is it
> only worth doing this for a large number of inserted/updated elements?
> What if the number of inserts/updates is only a dozen at a time for a
> large table (>10M entries)?
>

pretty minimal, but enough that doing a handful of rows at a time probably
wouldn't be worth it. You'd surely get index usage on a plain insert in
such a case, so I'd probably just use an upsert stored proc for doing small
numbers of rows - unless you are doing large numbers of inserts, just a few
at a time. In that case, I'd try to accumulate them and then do them in
bulk. Those are tough questions to answer without a specific context. My
real answer is 'try it and see.' You'll always get an answer that is
specific to your exact circumstance that way.

By the way, there is definitely a difference between creating a temp table
and creating a table temporarily. See the postgres docs about temp tables
for specifics, but many databases treat temp tables differently from
ordinary tables, so it is worth understanding what those differences are.
Temp tables are automatically dropped when a connection (or transaction)
is closed. Temp table names are local to the connection, so multiple
connections can each create a temp table with the same name without
conflict, which is convenient. I believe they are also created in a
specific tablespace on disk, etc.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Marc Eberhard 2012-01-06 22:20:35 Re: Duplicate deletion optimizations
Previous Message Marc Eberhard 2012-01-06 20:22:46 Re: Duplicate deletion optimizations