Re: tsvector pg_stats seems quite a bit off.

From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: tsvector pg_stats seems quite a bit off.
Date: 2010-05-30 07:08:32
Message-ID: 4C020EF0.40007@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2010-05-29 15:56, Jan Urbański wrote:
> On 29/05/10 12:34, Jesper Krogh wrote:
>
>> On 2010-05-28 23:47, Jan Urbański wrote:
>>
>>> On 28/05/10 22:22, Tom Lane wrote:
>>> Now I tried to substitute some numbers there, and so assuming the
>>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>>> statistics target to be 100.
>>>
>>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>>> probably be stopwords, so we will never see them in the input.
>>>
>>>
>> I think you should skip the assumption about stop-words, users may
>> use something where they are needed in the index or have a language
>> than the typical. (and they dont seem to influcence the math that much).
>>
> Turns out it has nearly linear influence on the bucket width and the
> frequency necessary to survive the final pruning. I put some data in a
> spreadsheet, results below.
>
>
How about setting it to "some default" in the first analyze round, but
setting it to the count of MCE's with a frequency of 1 in the subsequent
analyze rounds?

>> Isn't it the same "type" of logic that is used for collecting statistics
>> for "array-types", say integer-arrays and text arrays?
>>
> AFAIK statistics for everything other than tsvectors are built based on
> the values of whole rows. ts_typanalyze is the only typanalyze function
> that takes the trouble of looping over the actual contents of each cell,
> all the others just compare whole arrays (which means that for a text[]
> field you will probably a quite useless MCV entry).
>

In another area, I was thinking about modelling a complete tree structure
where I would like to extract complete sub-btranches as int[] of the
node-ids
in the set and then indexing them using gin. That seems like a "really
bad idea"
based on the above information.

Wouldn't it make sense to treat array types like the tsvectors?

> The results are attached in a text (CSV) file, to preserve formatting.
> Based on them I'd like to propose top_stopwords and error_factor to be 100.
>

I know it is not percieved the correct way to do things, but I would
really like to keep the "stop words" in the dataset and have
something that is robust to that.

There are 2 issues for that wish, one is that the application
becomes more general. I really cannot stop my users from searching
for stop-words and they would expect the "full set" and not the "empty
set" as
we get now.

The list of stop words is by no means an finite and would very
much depend on the input data set.

I would try to add the stop-words to the dictionary, so they still work, but
doesn't occupy that much space in the actual index. That seems to
solve the same task but with fewer issues for the users and a more
generalized
code around it.

>> I can "fairly easy" try out patches or do other kind of testing.
>>
> I'll try to come up with a patch for you to try and fiddle with these
> values before Monday.
>

Excellent.

testdb=# explain select id from testdb.reference where document_tsvector
@@ plainto_tsquery('where') order by id limit 50;
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (cost=41.02..41.03 rows=1 width=4)
-> Sort (cost=41.02..41.03 rows=1 width=4)
Sort Key: id
-> Bitmap Heap Scan on reference (cost=34.50..41.01 rows=1
width=4)
Recheck Cond: (document_tsvector @@
plainto_tsquery('where'::text))
-> Bitmap Index Scan on reference_fts_idx
(cost=0.00..34.50 rows=1 width=0)
Index Cond: (document_tsvector @@
plainto_tsquery('where'::text))
(7 rows)

testdb=# select id from testdb.reference where document_tsvector @@
plainto_tsquery('where') order by id limit 50;
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
id
----
(0 rows)

testdb=#

I would indeed have expected the first 50 rows ordered by id.. trivial
to extract.

--
Jesper

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2010-05-30 09:56:16 Re: [RFC][PATCH]: CRC32 is limiting at COPY/CTAS/INSERT ... SELECT + speeding it up
Previous Message Mohammad Heykal Abdillah 2010-05-30 05:43:25 Re: Is there anyway to get list of table name, before raw parser is analyze?