Skip site navigation (1) Skip section navigation (2)

Text search selectivity improvements (was Re: Google Summer of Code 2008)

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Text search selectivity improvements (was Re: Google Summer of Code 2008)
Date: 2008-03-19 01:44:09
Message-ID: 47E06FE9.8050703@students.mimuw.edu.pl (view raw or flat)
Thread:
Lists: pgsql-hackers
OK, here's a more detailed description of the FTS selectivity 
improvement idea:


=== Write a typanalyze function for column type tsvector ====

The function would go through the tuples returned by the BlockSampler 
and compute the number of times each distinct lexeme appears inside the 
tsvectors of those tuples. It would then store the most common lexemes, 
along with their frequencies in pg_statistics.
This will likely require adding a new STATISTIC_KIND_* constant, as with 
tsvector statistics we won't store the most common values of the 
tsvector column based on some kind of equality operator, but rather the 
most common lexemes appearing in those tsvectors.
The frequencies will be the fraction of total rows, that contain a 
particular lexeme in the tsvector column being analyzed.
Most frequent lexemes would be stored as text[] in stavalues1 and their 
frequencies as real[] in stanumbers1.

XXXs:
- Will looking for most common lexemes in just a few sample rows be 
enough to do useful selectivity estimation? Maybe the minimum number of 
rows returned by the sampler should be raised for this kind of 
stat-gathering. Or maybe it should be documented that it is advisable to 
SET STATISTICS for a tsvector column to something fairly large if you 
are to get good results from the planner.
- There are typically very few or none at all deletes in tables storing 
indexed documents. This means that when whe're regularly sampling rows 
and computing most common lexemes, maybe we shouldn't throw away the 
previous results? Maybe it would be smart to "merge" previous results 
with the freshly obtained? If assume no deletes were made between the 
ANALYZEs we could do the maths and do MCV and frequencies estimates 
based on that assumption and the previous results.
- Right now there seems to be some duplicate code in 
compute_minimal_stats and compute_scalar_stats, maybe this could be 
cleaned up as a side effect. The custom typanalyze function would also 
need to estimate the number of nonnull entries and the average width of 
the column, perheaps these could be made into separate functions and 
called from all three places (compute_minimal_stats, 
compute_scalar_stats, tsvector_typanalyze).
- Maybe there are other interesting statistics we could collect for 
tsvectors, something more fancy than just most common lexemes?

=== Write a selectivity estimation function for the @@ operator ===

The function would look at the tsquery and the statistics gathered by 
the function described earlier and return a selectivity estimation based 
on them.

For example, given
SELECT * FROM documents WHERE doc_vector @@ to_tsquery('dog')
if the lexeme 'dog' appears among the MCV of doc_vector and has a 
frequency of 0.7, we would get a 0.7 returned rows estimation.

Of course this is a very simple example, it'll be much harder than this. 
First, the function would have to walk the TSQuery and take it's 
structure in consideration. For example
SELECT * FROM documents WHERE doc_vector @@ to_tsquery('!dog')
would have to return a 0.3 estimation (or something more subtle than 
just 1 - 0.7?). Same goes for other modifiers like &, |.

If no lexemes from the tsquery are among MCV, we return an arbitrary 
0.001, as it is done currently for all queries.

=== Deploy these functions ===

This could at first be deployed as a contrib module, that would define 
tsvector_typanalyze (or maybe ts_typanalyze, to be consistent with other 
ts_* functions) and tsvectorsel and update pg_operator and pg_type so 
tsvector would be ANALYZed and @@ restricted with the new method.

So much for the idea, but I might very well have missed some crucial 
things that'd have to be done in order to pull this off. Comments, 
suggestions, criticism?

Regards,
Jan

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

pgsql-hackers by date

Next:From: Tom LaneDate: 2008-03-19 02:46:01
Subject: Re: Re: pgsql: Add URLs for : * Speed WAL recovery by allowing more than one
Previous:From: Tom LaneDate: 2008-03-19 01:28:50
Subject: Re: [COMMITTERS] pgsql: Enable probes to work with Mac OS X Leopard and other OSes that

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group