Re: gsoc, text search selectivity and dllist enhancments

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, text search selectivity and dllist enhancments
Date: 2008-07-08 22:33:48
Message-ID: 4873EB4C.3070404@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jan Urbański wrote:
> If you think the Lossy Counting method has potential, I could test it
> somehow. Using my current work I could extract a stream of lexemes as
> ANALYZE sees it and run it through a python implementation of the
> algorithm to see if the result makes sense.

I hacked together a simplistic python implementation and ran it on a
table with 244901 tsvectors, 45624891 lexemes total. I was comparing
results from my current approach with the results I'd get from a Lossy
Counting algorithm.
I experimented with statistics_target set to 10 and 100, and ran pruning
in the LC algorithm every 3, 10 or 100 tsvectors.
The sample size with statistics_target set to 100 was 30000 rows and
that's what the input to the script was - lexemes from these 30000
tsvectors.
I found out that with pruning happening every 10 tsvectors I got
precisely the same results as with the original algorithm (same most
common lexemes, same frequencies). When I tried pruning after every 100
tsvectors the results changed very slightly (they were a tiny bit more
distant from the ones from the original algorithm, and I think a tiny
bit more precise, but I didn't give it much attention).

Bottom line seems to be: the Lossy Counting algorithm gives roughly the
same results as the algorithm used currently and is also possibly faster
(and more scalable wrt. statistics_target).

This should probably get more testing than just running some script 5
times over a fixed set of data, but I had trouble already sucking ~300
MB of tsvectors from one of my production sites, putting it on my laptop
and so on.
Do you think it's worthwhile to implement the LC algorithm in C and send
it out, so others could try it out? Heck, maybe it's worthwhile to
replace the current compute_minimal_stats() algorithm with LC and see
how that compares?

Anyway, I can share the python script if someone would like to do some
more tests (I suppose no-one would, 'cause you first need to apply my
ts_typanalyze patch and then change it some more to extract lexemes from
the sample).

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-07-08 22:35:07 Re: Identifier case folding notes
Previous Message Tom Lane 2008-07-08 22:30:15 Re: Identifier case folding notes