Skip site navigation (1) Skip section navigation (2)

Re: gsoc, oprrest function for text search take 2

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, oprrest function for text search take 2
Date: 2008-09-19 16:05:36
Message-ID: 48D3CDD0.9090105@students.mimuw.edu.pl (view raw or flat)
Thread:
Lists: pgsql-hackers
ju219721(at)students(dot)mimuw(dot)edu(dot)pl wrote:
> Quoting Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> 
>> I wrote:
>>> ...  One possibly
>>> performance-relevant point is to use DatumGetTextPP for detoasting;
>>> you've already paid the costs by using VARDATA_ANY etc, so you might
>>> as well get the benefit.
>>
>> Actually, wait a second.  That code doesn't work at all on toasted data,
>> because it's trying to use VARSIZE_ANY_EXHDR() before detoasting.
>> That would give you the physical datum size (eg the size of the toast
>> pointer), not the number you need.
>>
>> However, this is actually not a problem because we know that the data
>> came from an array in pg_statistic, which means the individual members
>> *can't be toasted*.  At least they can't be compressed or out-of-line.
>> We'd do that at the array level, it's not sensible to do it on an
>> individual array member.
>>
>> I think that right at the moment the array stuff doesn't permit short
>> headers either, but it would make sense to relax that someday.  So I'd
>> recommend that your code allow either regular or short headers, but not
>> worry about compression or out-of-line storage.
>>
>> Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but
>> forget the "detoasting" step.  Maybe put in
>>     Assert(!VARATT_IS_COMPRESSED(datum) && !VARATT_IS_EXTERNAL(datum))
>> instead.

Well whaddya know. It turned out that my new company has a 
'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a 
full day to work on the patch.
Attached is a version that stores the minimal and maximal frequencies in 
the Numbers array, has the aforementioned assertion and more nicely 
ordered functions in ts_selfuncs.c.

I tested it with oprofile and
pgbench -n -f tssel-bench.sql -t 1000 postgres
with tssel-bench.sql containing
select * from manuals where tsvector @@ to_tsquery('foo');

"manuals" has ~700 rows and 'foo' does not appear in any of the lexemes.

The results are:
=== CVS HEAD ===
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.399584 (including connections establishing)
tps = 13.399972 (excluding connections establishing)

74069    34.7779  pglz_decompress
38560    18.1052  tsvectorout
7688      3.6098  pg_mblen
5366      2.5195  hash_search_with_hash_value
4833      2.2693  pg_utf_mblen
4718      2.2153  AllocSetAlloc
4041      1.8974  index_getnext
3100      1.4556  LWLockAcquire
3056      1.4349  hash_any
2843      1.3349  LWLockRelease
2611      1.2260  AllocSetFree
2126      0.9982  tsCompareString
2121      0.9959  _bt_compare
1830      0.8592  LockAcquire
1517      0.7123  toast_fetch_datum
1503      0.7057  .plt
1338      0.6282  _bt_checkkeys
1332      0.6254  FunctionCall2
1233      0.5789  ReadBuffer_common
1185      0.5564  slot_deform_tuple
1157      0.5433  TParserGet
1123      0.5273  LockRelease


=== PATCH ===
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.309346 (including connections establishing)
tps = 13.309761 (excluding connections establishing)

171514   35.0802  pglz_decompress
87231    17.8416  tsvectorout
17107     3.4989  pg_mblen
12514     2.5595  hash_search_with_hash_value
11124     2.2752  pg_utf_mblen
10739     2.1965  AllocSetAlloc
8534      1.7455  index_getnext
7460      1.5258  LWLockAcquire
6876      1.4064  LWLockRelease
6622      1.3544  hash_any
5773      1.1808  AllocSetFree
5210      1.0656  _bt_compare
4849      0.9918  tsCompareString
4043      0.8269  LockAcquire
3535      0.7230  .plt
3246      0.6639  _bt_checkkeys
3170      0.6484  toast_fetch_datum
3057      0.6253  FunctionCall2
2815      0.5758  ReadBuffer_common
2767      0.5659  TParserGet
2605      0.5328  slot_deform_tuple
2567      0.5250  MemoryContextAlloc

Cheers,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment: tssel-oprrest-presorted.diff
Description: text/plain (21.5 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2008-09-19 16:11:02
Subject: Re: [PATCHES] libpq events patch (with sgml docs)
Previous:From: Robert HaasDate: 2008-09-19 15:34:06
Subject: Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep)

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group