Re: Performance on inserts

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jules Bean <jules(at)jellybean(dot)co(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Performance on inserts
Date: 2000-08-25 15:13:08
Message-ID: 2378.967216388@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jules Bean <jules(at)jellybean(dot)co(dot)uk> writes:
> So why is updating the huge (2G) unique index on (name,cat) not too
> much of a problem, but updating the small (600M) non-unique index on
> (cat) sufficient to increase speed by around two orders of magnitude?

> The real slow-down is noticeable when I'm doing a bulk insert where
> all new rows belong to the most popular category. I know that some
> btree implementations don't behave particularly sanely with several
> million rows in a single key.. is the btree implementation used too
> slow in this case?

Indeed, the btree code used in 7.0 and before is pretty awful in the
face of large numbers of equal keys. I have improved it somewhat,
but it's still got a problem once there are enough equal keys to span
many index pages. I just did this experiment:

create table fill(f1 text);
create index filli on fill(f1);
insert into fill values('foo');
insert into fill values('bar');
insert into fill values('baz');

insert into fill select * from fill; -- repeat this many times

The runtime of the insert/select scales in a pretty horrid fashion:

# tuples inserted 6.5 current

1536 <1sec <1sec
3072 1.56 <1sec
6144 3.70 1.84
12288 9.73 4.07
24576 93.26 38.72
49152 363.23 231.56

At the end of this process we have about 100 pages of index entries
for each of the three distinct key values, with about 330 items per page.
The fixes that I applied a month or two back improve behavior for
multiple equal keys within a page, but they do little for multiple
pages full of the same key value. Here's the problem: the code for
INSERT starts by finding the first occurrence of the target key (using
btree descent, so that's fast enough). But then it does this:

/*
* If we will need to split the page to put the item here,
* check whether we can put the tuple somewhere to the right,
* instead. Keep scanning until we find enough free space or
* reach the last page where the tuple can legally go.
*/
while (PageGetFreeSpace(page) < itemsz &&
!P_RIGHTMOST(lpageop) &&
_bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0)
{
/* step right one page */
BlockNumber rblkno = lpageop->btpo_next;

_bt_relbuf(rel, buf, BT_WRITE);
buf = _bt_getbuf(rel, rblkno, BT_WRITE);
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
}

In other words, we do a linear scan over all the pages containing equal
key values, in the hope of finding one where we can shoehorn in the new
entry. If the keys are roughly equal-sized, that hope is usually vain,
and we end up splitting the rightmost page that contains any keys
equal to the target key. Subsequent inserts of the same key value fill
first the left-half split page, then the right-half, then split the
right-half page again and repeat --- but for each insert we first had
to scan over all the preceding pages containing equal keys. That means
inserting N equal keys takes O(N^2) time, for large enough N.

After thinking about this for a while, I propose a probabilistic
solution. Suppose that we add to the above loop a random-number check
that causes us to fall out of the loop with probability 1/3 or so;
that is, if the current page is full of the target key, then we move
right to check the next page with probability 2/3, but with probability
1/3 we give up and split the current page to insert the new key.

With this approach we'd hardly ever move more than, say, 10 pages before
splitting, so the time to insert a new key is bounded no matter how many
duplicates it has.

The reason for using a random decision is that if we always stopped
after scanning a fixed number of pages, then the space freed up by the
previous decision to split would always be just out of reach, and we'd
end up with lots of half-full pages of the same key. With a random
decision we will visit and fill both the left and right halves of a
previously split page.

Comments? Is there a better way? What's the best probability to use?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vince Vielhaber 2000-08-25 15:18:09 Re: AW: How Do You Pronounce "PostgreSQL"?
Previous Message Thomas Lockhart 2000-08-25 15:12:39 Re: advice on extensions needed