Re: [GENERAL] Creation of tsearch2 index is very slow

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Stephan Vollmer <svollmer(at)gmx(dot)de>, pgsql-performance(at)postgreSQL(dot)org
Subject: Re: [GENERAL] Creation of tsearch2 index is very slow
Date: 2006-01-20 21:50:17
Message-ID: 27264.1137793817@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> Given that all it's doing is counting bits, a simple fix would be to
> loop over bytes, use XOR and count ones. For extreme speedup create a
> lookup table with 256 entries to give you the answer straight away...

Yeah, I just finished doing that and got about a 20x overall speedup
(30-some seconds to build the index instead of 10 minutes). However,
hemdistsign is *still* 70% of the runtime after doing that. The problem
seems to be that gtsvector_picksplit is calling it an inordinate number
of times:

0.53 30.02 1649/1649 FunctionCall2 <cycle 2> [19]
[20] 52.4 0.53 30.02 1649 gtsvector_picksplit [20]
29.74 0.00 23519673/33035195 hemdistsign [18]
0.14 0.00 22985469/22985469 hemdistcache [50]
0.12 0.00 268480/10030581 makesign [25]
0.02 0.00 270400/270400 fillcache [85]
0.00 0.00 9894/4077032 AllocSetAlloc [34]
0.00 0.00 9894/2787477 MemoryContextAlloc [69]

(hemdistcache calls hemdistsign --- I think gprof is doing something
funny with tail-calls here, and showing hemdistsign as directly called
from gtsvector_picksplit when control really arrives through hemdistcache.)

The bulk of the problem is clearly in this loop, which performs O(N^2)
comparisons to find the two entries that are furthest apart in hemdist
terms:

for (k = FirstOffsetNumber; k < maxoff; k = OffsetNumberNext(k))
{
for (j = OffsetNumberNext(k); j <= maxoff; j = OffsetNumberNext(j))
{
if (k == FirstOffsetNumber)
fillcache(&cache[j], GETENTRY(entryvec, j));

size_waste = hemdistcache(&(cache[j]), &(cache[k]));
if (size_waste > waste)
{
waste = size_waste;
seed_1 = k;
seed_2 = j;
}
}
}

I wonder if there is a way to improve on that.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jim C. Nasby 2006-01-20 22:00:40 Re: Page-Level Encryption
Previous Message Rick Gigger 2006-01-20 21:49:12 panic on 7.3

Browse pgsql-performance by date

  From Date Subject
Next Message Jignesh K. Shah 2006-01-20 22:13:58 Re: Sudden slowdown of Pg server
Previous Message Steinar H. Gunderson 2006-01-20 21:44:02 Re: [GENERAL] Creation of tsearch2 index is very slow