From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|

To: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> |

Cc: | Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org> |

Subject: | Re: gsoc, text search selectivity and dllist enhancments |

Date: | 2008-07-10 21:02:36 |

Message-ID: | 14238.1215723756@sss.pgh.pa.us |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:

> Still, there's a decision to be made: after how many lexemes should the

> pruning occur?

The way I think it ought to work is that the number of lexemes stored in

the final pg_statistic entry is statistics_target times a constant

(perhaps 100). I don't like having it vary depending on tsvector width

--- why for example should a column having a few wide tsvectors get a

bigger stats entry than one with many narrow ones? (Not to mention the

issue of having to estimate the average or max width before you can

start the counting run.)

But in any case, given a target number of lexemes to accumulate,

I'd suggest pruning with that number as the bucket width (pruning

distance). Or perhaps use some multiple of the target number, but

the number itself seems about right. The LC paper says that the

bucket width w is equal to ceil(1/e) where e is the maximum frequency

estimation error, and that the maximum number of table entries needed

is log(eN)/e after N lexemes have been scanned. For the values of e

and N we are going to be dealing with, this is likely to work out to

a few times 1/e, in other words the table size is a few times w.

(They prove it's at most 7w given reasonable assumptions about data

distribution, regardless of how big N gets; though I think our values

for N aren't large enough for that to matter.)

The existing compute_minimal_stats code uses a table size of twice the

target number of values, so setting w to maybe a half or a third of the

target number would reproduce the current space usage. I don't see a

problem with letting it get a little bigger though, especially since we

can expect that the lexemes aren't very long. (compute_minimal_stats

can't assume that for arbitrary data types...)

regards, tom lane

- Re: gsoc, text search selectivity and dllist enhancments at 2008-07-10 20:32:26 from Jan Urbański

- Re: gsoc, text search selectivity and dllist enhancments at 2008-07-10 21:26:35 from Jan Urbański

From | Date | Subject | |
---|---|---|---|

Next Message | Michelle Caisse | 2008-07-10 21:24:27 | Re: Generating code coverage reports |

Previous Message | Tom Lane | 2008-07-10 20:37:28 | Re: gsoc, text search selectivity and dllist enhancments |