Quick Links

[jules@jellybean.co.uk: Performance on inserts]

From:	Jules Bean <jules(at)jellybean(dot)co(dot)uk>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	[jules@jellybean.co.uk: Performance on inserts]
Date:	2000-08-25 11:19:04
Message-ID:	20000825121902.A19913@pear.presence.net.uk
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Weird. When I sent this 24 hours ago, it didn't go through to the
list, although I successfully sent a message to the list only an hour
or so before that.... as far as my mail-server can tell it went off in
the right direcion...

----- Forwarded message from Jules Bean <jules(at)jellybean(dot)co(dot)uk> -----

Date: Thu, 24 Aug 2000 11:04:25 +0100
From: Jules Bean <jules(at)jellybean(dot)co(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Performance on inserts
User-Agent: Mutt/1.2i

[Meta: Please advise if this is the wrong list. I think, since this
observation relates to Pg internals, this might be the right one, but
feel free to move it back to -general if I'm wrong; I subscribe to
both in any case]

As some of you will have inferred if you've read my last couple of
posts, I'm working on a database with a structure like this (I'm
abstracting; I'm not unfortunately allowed to show you what my client
is really doing here)

A table called 'things' with two columns 'name' and 'category'. The
pair ('name','category') is a primary key.

There are ~ 10 000 000 rows in the table, and category takes values
from a more-or-less fixed set of ~1000 possibilities. As previously
described the most popular category holds around half the rows, the
next most popular holds nearly half of those left, and most categories
occur very rarely indeed. The median is probably around 1000 (which is
less than the 10 000 you'd expect).

Anyhow, this question isn't about speeding up queries --- we already
have that in the histogram thread. This question is about speeding up
inserts.

My standard configuration is to have a unique index on (name,category)
and a non-unique index on (category). The main table is ~ 2G on disk,
the index on (name,cat) is about the same size, the index on (cat) is
around 0.6G.

In this set-up inserts have dropped to the terrifyingly slow rate of
several hours per 10 000 insertions. This is not adequate to my needs,
I occasionally have to process 1 000 000 insertions or more!

I have several ideas for speeding this up at the SQL level (including
inserting into a temp table and then using INSERT ... SELECT to remove
the overhead of separate inserts) but that's not what I want to talk
about either...

What I did to day, which made a staggering difference, is dropping the
non-unique index on (category). Suddenly I can insert at approx 40 000
insertions per minute, which is fine for my needs!

So why is updating the huge (2G) unique index on (name,cat) not too
much of a problem, but updating the small (600M) non-unique index on
(cat) sufficient to increase speed by around two orders of magnitude?

A possible reason has occurred to me:

The real slow-down is noticeable when I'm doing a bulk insert where
all new rows belong to the most popular category. I know that some
btree implementations don't behave particularly sanely with several
million rows in a single key.. is the btree implementation used too
slow in this case?

I haven't collected all the performance statistics I'd like to have,
due to external time pressures, but I can say that under the new
faster configuration, the insertion process is CPU bound, with disk
access far below the levels the machine is capable of. If I have a chance
I'll collect these stats for the old method too.

Any ideas as to what's going on here appreciated (if not, perhaps it
will point you towards an optimisation you ought to make for 7.1)

Jules

----- End forwarded message -----

--
Jules Bean | Any sufficiently advanced
jules(at)debian(dot)org | technology is indistinguishable
jules(at)jellybean(dot)co(dot)uk | from a perl script

Responses

Re: Performance on inserts at 2000-08-25 15:13:08 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Magnus Hagander	2000-08-25 11:31:51	RE: libpq.dll and VB
Previous Message	Oleg Bartunov	2000-08-25 08:59:50	Re: [7.0.2] problems with spinlock under FreeBSD?