Re: slow speeds after 2 million rows inserted

From: "Chad Wagner" <chad(dot)wagner(at)gmail(dot)com>
To: "Nikola Milutinovic" <alokin1(at)yahoo(dot)com>
Cc: "PostgreSQL general" <pgsql-general(at)postgresql(dot)org>
Subject: Re: slow speeds after 2 million rows inserted
Date: 2006-12-31 17:08:12
Message-ID: 81961ff50612310908q71d9a81bx21264612423dd04d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 12/31/06, Nikola Milutinovic <alokin1(at)yahoo(dot)com> wrote:
>
> > 1. There is no difference (speed-wise) between committing every 1K or
> every 250K rows.
>
> It was really some time ago, since I have experimented with this. My las
> experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows.
> Inserting without transactions took 25 hrs. Inserting with 10,000 rows per
> transaction took about 2.5 hrs. So, the speedup was 10x. I have not
> experimented with the transaction batch size, but I suspect that 1,000 would
> not show much speedup.
>

I would imagine the overhead here is sending the BEGIN/COMMIT (or the
behavior could be version dependent), and the commit is forcing a log
flush. According to the documentation multiple inserts have the potential
of being flushed in one shot when the database wakes up to do a logflush
automatically, so committing more frequently would actually appear to slow
you down.

The time to commit is dependent on hardware, on my hardware it was around
40ms (which you have to question the precision of the calculation, clearly
it is probably less than 40ms and that is an upper limit under load).

My experiment was with 8.2, default configuration, so there is room for
improvement. It was installed from Devrim's RPM packages.

> 2. Vacuuming also makes no difference for a heavy insert-only table, only
> slows it down.
>
> Makes sense. Since my application was dumping all records each month and
> inserting new ones, vacuum was really needed, but no speedup.
>

I agree, vacuuming is clearly important. I would also think if you are
going to do a massive one-time update/delete as a maintenance item on a
normally read only table that you should plan on doing a vacuum full to
recover the space used by the "old" rows. Logically the fewer pages on
disk, the less I/O that will result in scenarios where you are doing
sequential scans and probably even many index scans.

It seems that the MVCC implementation would introduce fragmentation (with
respect to a btree indexed field) if your table design had an indexed
creation date field and you often range scanned on that field but also
updated the record then over time the optimizer would less favor the index
as the correlation approached 0. Obviously this is a great feature for a
"last update date" field. :) Not so great if your primary queries are on a
creation date field.

> 3. Table size plays no real factor.
>
> The reason I saw speedup, must have to do with the fact that without
> transactions, each insert was it's own transaction. That was eating
> resources.
>

I would agree. I am also reusing the same statement handles (prepare once,
execute many) with DBD::Pg. The benefit here appears to be that it prepares
the cursor once (one time to parse and generate the execution plan), and
executes the same plan multiple times. The difference in inserts was about
2000 inserts/s!

This is the one of the reasons why everyone keeps saying use COPY instead of
INSERT, COPY is essentially a one time prepare and execute many.

Test #1 (prepare once, execute many):
4000 inserts 0.92 secs, 4368.84 inserts/s, commit 0.04 secs.
4000 inserts 0.93 secs, 4303.47 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4319.78 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4306.38 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4299.53 inserts/s, commit 0.02 secs.
4000 inserts 0.92 secs, 4345.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4298.67 inserts/s, commit 0.03 secs.
4000 inserts 0.91 secs, 4382.13 inserts/s, commit 0.04 secs.
4000 inserts 0.92 secs, 4347.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4314.66 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 28.7435
inserts/s (Mean) = 4328.6351

Test #2 (prepare/execute each time):
4000 inserts 1.92 secs, 2086.21 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2092.67 inserts/s, commit 0.02 secs.
4000 inserts 1.91 secs, 2094.54 inserts/s, commit 0.05 secs.
4000 inserts 1.96 secs, 2042.55 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.57 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2098.91 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.38 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2083.94 inserts/s, commit 0.02 secs.
4000 inserts 1.95 secs, 2050.07 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2086.14 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 19.2360
inserts/s (Mean) = 2083.1987

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2006-12-31 17:58:40 Re: Generic timestamp function for updates where field
Previous Message Richard Broersma Jr 2006-12-31 17:00:50 Re: slow speeds after 2 million rows inserted