Re: Performance on inserts

From: Jules Bean <jules(at)jellybean(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alfred Perlstein <bright(at)wintelcom(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Performance on inserts
Date: 2000-08-26 10:48:58
Message-ID: 20000826114855.A1524@pear.presence.net.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 25, 2000 at 07:00:22PM -0400, Tom Lane wrote:
> The algorithm does seem to work quite nicely just the way I described
> it, although it turns out I was off base about a good probability
> setting. I find that something up around 0.99 seems to be good.
> Using the same (perhaps overly simplistic) test case:
>
> # tuples inserted 6.5 current+random hack @ 0.99
> Time index size Time index size
> 1536 <1sec 90112 <1sec 106496
> 3072 1.56 163840 <1sec 188416
> 6144 3.70 286720 1.40 376832
> 12288 9.73 532480 2.65 688128
> 24576 93.26 1024000 5.22 1368064
> 49152 363.23 2007040 10.34 2727936
> 98304 22.07 5545984
> 196608 45.60 11141120
> 393216 92.53 22290432
>
> I tried probabilities from 0.67 to 0.999 and found that runtimes didn't
> vary a whole lot (though this is near the minimum), while index size
> consistently got larger as the probability of moving right decreased.
> The runtime is nicely linear throughout the range.

That looks brilliant!! (Bearing in mind that I have over 10 million
tuples in my table, you can imagine what performance was like for me!)
Is there any chance you could generate a patch against released 7.0.2
to add just this functionality... It would be the kiss of life for my
code!

(Not in a hurry, I'm not back in work until Wednesday, as it happens)

And, of course, what would /really/ get my code going speedily would
be the partial indices mentioned elsewhere in this thread. If the
backend could automagically drop keys containing > 10% (tunable) of
the rows from the index, then my index would be (a) about 70% smaller!
and (b) only used when it's faster. [This means it would have to
update some simple histogram data. However, I can't see that being
much of an overhead]

For the short term, if I can get a working version of the above
randomisation patch, I think I shall 'fake' a partial index by
manually setting 'enable_seqscan=off' for all but the 4 or 5 most
common categories. Those two factors combined will speed up my bulk
inserts a lot.

One other idea, though:

Is there any simple way for Pg to combine inserts into one bulk?
Specifically, their effect on the index files. It has always seemed
to me to be one of the (many) glaring flaws in SQL that the INSERT
statement only takes one row at a time. But, using INSERT ... SELECT,
I can imagine that it might be possible to do 'bulk' index
updating. so that scanning process is done once per 'batch'.

If I can make an analogy with sorted files (which indices are rather
like), if I wanted to add another 100 lines to a 1000 line sorted
file, I'd sort the 100 first, and then 'merge' them in. Whilst I
realise that indices aren't stored sorted (no need), I think it ought
to be possible to construct an efficient algorithm for merging two
btrees?

Jules

--
Jules Bean | Any sufficiently advanced
jules(at)debian(dot)org | technology is indistinguishable
jules(at)jellybean(dot)co(dot)uk | from a perl script

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthew Kirkwood 2000-08-26 11:14:06 Re: Performance on inserts
Previous Message Mario Weilguni 2000-08-26 09:08:10 TNS Services like Oracle?