Re: Most efficient way to insert without duplicates

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: François Beausoleil <francois(at)teksol(dot)info>
Cc: Forums postgresql <pgsql-general(at)postgresql(dot)org>
Subject: Re: Most efficient way to insert without duplicates
Date: 2013-04-17 21:07:23
Message-ID: CAMkU=1zNL+7=HhiTx1O_2arDRws63sWZr0SC1ByggF5fN1WZJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Apr 17, 2013 at 1:19 PM, François Beausoleil
<francois(at)teksol(dot)info>wrote:

>
> Le 2013-04-17 à 14:15, Jeff Janes a écrit :
>
>
> It looks like 12% of the time is being spent figuring out what rows to
> insert, and 88% actually doing the insertions.
>
> So I think that index maintenance is killing you. You could try adding a
> sort to your select so that rows are inserted in index order, or inserting
> in batches in which the batches are partitioned by service_id (which is
> almost the same thing as sorting, since service_id is the lead column)
>
>
This analysis is based on your example, which inserted 7 million rows. But
I just noticed you also said you only have a few thousands rows to insert
per day. So if you make your example better match your use case, perhaps
that analysis would no longer hold.

>
> In that case, partitioning the original table by service_id % N would
> help, since the index would be much smaller, right?
>

Probably not. If you partition the table but do not change your loading
method, then the relevant thing would be the sum of the index sizes over
all partitions, which would be about the same as now.

On the other hand, if you change the method to load the data in batches,
you don't need to partition the table, you just need to align the batches
with the index order. You could use partitioning as a way to do that, but
it is just as easy (or easier) to do so without partitioning.

Once you solve the index maintenance problem, partitioning might help solve
the select part of the deduplication problem, though. You would only need
to check against existing data for the partition into which you already
know the current batch is going to be loaded.

Also, the constraint_exclusion code is usually not smart enough to deal
with constraints that use the modulus (unless the modulus itself appears in
the where clause). You would have to use range partitioning instead.

Cheers,

Jeff

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Marko Kreen 2013-04-17 21:33:56 Re: [GENERAL] currval and DISCARD ALL
Previous Message Mike Roest 2013-04-17 20:45:52 Re: Fetching Server configured port from C Module