Re: gsoc, oprrest function for text search take 2

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, oprrest function for text search take 2
Date: 2008-08-14 11:02:15
Message-ID: 48A410B7.3020004@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Jan Urbański wrote:
>> So right now the idea is to:
>> (1) pre-sort STATISTIC_KIND_MCELEM values
>> (2) build an array of pointers to detoasted values in tssel()
>> (3) use binary search when looking for MCELEMs during tsquery analysis
>
> Sounds like a plan. In (2), it's even better to detoast the values
> lazily. For a typical one-word tsquery, the binary search will only look
> at a small portion of the elements.

Hm, how can I do that? Toast is still a bit black magic to me... Do you
mean I should stick to having Datums in TextFreq? And use DatumGetTextP
in bsearch() (assuming I'll get rid of qsort())? I wanted to avoid that,
so I won't detoast the same value multiple times, but it's true: a
binary search won't touch most elements.

> Another thing is, how significant is the time spent in tssel() anyway,
> compared to actually running the query? You ran pgbench on EXPLAIN,
> which is good to see where in tssel() the time is spent, but if the time
> spent in tssel() is say 1% of the total execution time, there's no point
> optimizing it further.

Changed to the pgbench script to
select * from manual where tsvector @@ to_tsquery('foo');
and the parameters to
pgbench -n -f tssel-bench.sql -t 1000 postgres

and got

number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 12.238282 (including connections establishing)
tps = 12.238606 (excluding connections establishing)

samples % symbol name
174731 31.6200 pglz_decompress
88105 15.9438 tsvectorout
17280 3.1271 pg_mblen
13623 2.4653 AllocSetAlloc
13059 2.3632 hash_search_with_hash_value
10845 1.9626 pg_utf_mblen
10335 1.8703 internal_text_pattern_compare
9196 1.6641 index_getnext
9102 1.6471 bttext_pattern_cmp
8075 1.4613 pg_detoast_datum_packed
7437 1.3458 LWLockAcquire
7066 1.2787 hash_any
6811 1.2325 AllocSetFree
6623 1.1985 pg_qsort
6439 1.1652 LWLockRelease
5793 1.0483 DirectFunctionCall2
5322 0.9631 _bt_compare
4664 0.8440 tsCompareString
4636 0.8389 .plt
4539 0.8214 compare_two_textfreqs

But I think I'll go with pre-sorting anyway, it feels cleaner and neater.
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2008-08-14 11:17:17 Re: Parsing of pg_hba.conf and authentication inconsistencies
Previous Message Gregory Stark 2008-08-14 10:22:24 Re: Join Removal/ Vertical Partitioning